In-Context Learning

introduction

In-context learning (ICL) is the ability of a large language model to learn a new task at inference time by conditioning on a prompt that contains a few input-output examples (demonstrations), without updating any of the model's parameters. The term was introduced by Brown et al. in the GPT-3 paper "Language Models are Few-Shot Learners" (2020), where the authors showed that a sufficiently large autoregressive language model could perform a wide range of natural language processing tasks simply by being given a handful of examples in the prompt.

What makes in-context learning remarkable is that it requires no gradient descent, no backpropagation, and no changes to the model's weights. The model receives a sequence of demonstrations followed by a query input, and it produces the appropriate output by leveraging patterns it learned during pretraining. This stands in sharp contrast to the traditional machine learning paradigm of fine-tuning, where a pretrained model's weights are explicitly updated on task-specific data.

Since the GPT-3 paper, in-context learning has become one of the most widely studied phenomena in modern deep learning and is widely viewed as one of the defining emergent properties of large transformer language models. Researchers have proposed multiple theoretical explanations for why it works, identified the factors that influence its effectiveness, and documented its limitations. The expansion of context windows from 2,048 tokens in early GPT-3 deployments to 1 million tokens or more in models such as Gemini 1.5 Pro and Claude 3 has further reshaped how ICL is used in practice. This article covers the definition, mechanisms, theoretical frameworks, practical considerations, and open questions surrounding in-context learning.

terminology and origin

The term "in-context learning" was popularized by Brown et al. (2020), though the underlying behavior had been observed in smaller form in earlier work. Radford et al. (2019), in the GPT-2 paper "Language Models are Unsupervised Multitask Learners," had already demonstrated that a large language model could perform downstream tasks zero-shot by interpreting natural-language task descriptions during sampling. Brown et al. extended this idea by systematically studying the few-shot setting and giving it a name.

In the GPT-3 paper, in-context learning describes the procedure of conditioning a frozen language model on a prompt of the form

[task description]
[example 1 input] -> [example 1 output]
[example 2 input] -> [example 2 output]
...
[query input] ->

and reading off the model's continuation as the predicted output. The number of examples is referred to as the "shot" count.

Term	Number of demonstrations	Description
Zero-shot	0	The model receives only a task instruction or query with no examples. It must rely entirely on knowledge acquired during pretraining.
One-shot	1	The model receives a single demonstration example before the query.
Few-shot	2 to roughly 30	The model receives a small set of demonstrations. This is the most common ICL setting in research and production.
Many-shot	Dozens to thousands	The model receives a large number of demonstrations, enabled by long context windows of 100,000 tokens or more.
k-shot	k examples	A general notation used in research to denote that exactly k demonstrations are provided.

Brown et al. (2020) evaluated GPT-3 across all of these settings and found that performance generally improved as more demonstrations were provided. They also reported that the gap between zero-shot and few-shot performance widened with model scale: larger models benefited more from additional demonstrations, suggesting that ICL itself becomes stronger with scale.

In modern usage, "in-context learning" is sometimes restricted to the case where labeled demonstrations are provided, while purely instruction-based zero-shot prompting is treated as a separate paradigm. This article uses the broader interpretation that includes zero-shot prompting as a degenerate case of ICL with no demonstrations.

how in-context learning works

In a standard in-context learning setup, the user constructs a prompt consisting of three components:

An optional task instruction describing what the model should do.
A set of demonstration examples, each consisting of an input paired with its correct output.
A query input for which the model must produce the output.

For example, a sentiment analysis prompt might look like this:

Review: "The food was excellent." Sentiment: Positive
Review: "Terrible service, never coming back." Sentiment: Negative
Review: "The ambiance was lovely but the pasta was overcooked." Sentiment:

The model processes this entire sequence and generates a completion for the final, incomplete example. Because the model's weights are frozen during this process, all of the "learning" happens through the forward pass of the transformer architecture. The attention mechanism allows the model to attend to the demonstration examples, identify the pattern connecting inputs to outputs, and apply that pattern to the query.

the role of the attention mechanism

The self-attention layers in a transformer are central to in-context learning. Each attention head can, in principle, compare the query input against the demonstration examples and extract task-relevant information. As the prompt passes through multiple transformer layers, the model progressively refines its internal representation of the task. The model's behavior can be thought of as performing a kind of conditional generation: given the context window of demonstrations, the model produces the most likely continuation, which for well-formatted demonstrations turns out to correspond to the correct output for the query.

Research by Olsson et al. (2022) identified specific attention heads called "induction heads" that play a mechanistic role in this process. An induction head detects a pattern of the form [A][B]...[A] in the input sequence and predicts that B should follow the second occurrence of A. While this is a simple pattern-completion algorithm, the authors present evidence that induction heads serve as a building block for the more general in-context learning ability observed in large transformers.

the forward pass as a learning algorithm

A central conceptual shift introduced by ICL is that the forward pass of the transformer can itself be interpreted as a small-scale learning algorithm. The pretrained weights play the role of a fixed meta-learner, and the prompt plays the role of a tiny dataset. The transformer produces the output as if it had performed a brief training procedure on that dataset. This view, sometimes called the meta-learning view of ICL, motivates much of the theoretical work discussed below.

a brief history

The study of in-context learning has progressed quickly since 2020. The following table summarizes key papers in the development of the field.

Year	Paper	Authors	Contribution
2019	"Language Models are Unsupervised Multitask Learners"	Radford et al. (OpenAI)	Demonstrated zero-shot task transfer in GPT-2, the precursor to ICL.
2020	"Language Models are Few-Shot Learners" (arXiv:2005.14165)	Brown et al. (OpenAI)	Introduced the term in-context learning, demonstrated few-shot ICL in GPT-3 across many tasks.
2021	"Calibrate Before Use" (arXiv:2102.09690)	Zhao et al.	Identified majority-label, recency, and common-token biases in ICL and proposed contextual calibration.
2021	"Fantastically Ordered Prompts and Where to Find Them" (arXiv:2104.08786)	Lu et al.	Demonstrated that demonstration order can swing performance from near state-of-the-art to random.
2021	"What Makes Good In-Context Examples for GPT-3?" (arXiv:2101.06804)	Liu et al.	Proposed retrieval of semantically similar demonstrations (KATE).
2022	"An Explanation of In-context Learning as Implicit Bayesian Inference" (arXiv:2111.02080)	Xie et al.	Framed ICL as implicit Bayesian inference over latent concepts; introduced GINC.
2022	"MetaICL: Learning to Learn In Context" (arXiv:2110.15943)	Min et al.	Showed that meta-training on a diverse task mixture improves few-shot ICL on held-out tasks.
2022	"Rethinking the Role of Demonstrations" (arXiv:2202.12837)	Min et al.	Found that randomized labels barely hurt ICL; format and label space matter most.
2022	"Chain-of-Thought Prompting Elicits Reasoning" (arXiv:2201.11903)	Wei et al.	Introduced chain-of-thought prompting as a form of ICL for reasoning tasks.
2022	"Self-Consistency Improves Chain of Thought" (arXiv:2203.11171)	Wang et al.	Augmented chain-of-thought with majority voting over sampled reasoning paths.
2022	"What Can Transformers Learn In-Context?" (arXiv:2208.01066)	Garg et al.	Trained transformers from scratch on synthetic ICL tasks and characterized learnable function classes.
2022	"In-context Learning and Induction Heads" (arXiv:2209.11895)	Olsson et al. (Anthropic)	Mechanistic explanation: induction heads emerge during a phase change and drive ICL.
2022	"Transformers Learn In-Context by Gradient Descent" (arXiv:2212.07677)	von Oswald et al.	Showed linear self-attention layers can implement gradient descent steps; transformers as mesa-optimizers.
2022	"Why Can GPT Learn In-Context?" (arXiv:2212.10559)	Dai et al.	Argued attention has a dual form of gradient descent and that ICL behaves like implicit fine-tuning.
2023	"What Learning Algorithm is In-Context Learning?" (arXiv:2211.15661)	Akyurek et al.	Identified that trained transformers match gradient descent or ridge regression depending on depth and noise.
2023	"Larger Language Models Do In-Context Learning Differently" (arXiv:2303.03846)	Wei et al.	Showed that large models can override semantic priors and learn from flipped labels.
2023	"In-Context Learning Creates Task Vectors" (arXiv:2310.15916)	Hendel et al.	Identified compact internal task vectors produced by ICL.
2024	"Function Vectors in Large Language Models" (arXiv:2310.15213)	Todd et al.	Identified causally relevant function vectors in autoregressive transformers.
2024	"Many-Shot In-Context Learning" (arXiv:2404.11018)	Agarwal et al. (Google DeepMind)	Studied ICL with hundreds to thousands of demonstrations; introduced Reinforced and Unsupervised ICL.
2024	"A Survey on In-context Learning" (arXiv:2301.00234)	Dong et al.	Comprehensive survey of ICL methods, theory, and applications.

theoretical explanations

One of the central questions in in-context learning research is: what computational process is the model actually performing when it learns from demonstrations? Several complementary theoretical frameworks have been proposed. None is universally accepted, and they may describe the same underlying phenomenon at different levels of abstraction.

implicit gradient descent

Multiple research groups have independently drawn connections between in-context learning and gradient-based optimization.

Dai et al. (2023) showed that the transformer attention mechanism has a mathematical "dual form" that resembles gradient descent. In their framework, the demonstration examples produce "meta-gradients" that are implicitly applied to the model's representations, effectively fine-tuning the model's behavior for the current task without actually modifying the stored weights. They provided empirical evidence that in-context learning and explicit fine-tuning produce similar behavioral patterns across multiple tasks. They also designed a momentum-based attention variant, inspired by gradient descent with momentum, that improved performance, offering further support for the gradient-descent interpretation.

Von Oswald et al. (2023) proved a stronger result for linear self-attention layers: a single linear self-attention layer can exactly replicate one step of gradient descent on a regression loss. They showed that trained transformers become "mesa-optimizers," meaning they learn to implement an optimization algorithm (gradient descent) within their forward pass. Their experiments demonstrated that transformers can even learn to apply curvature corrections analogous to second-order optimization methods, outperforming plain gradient descent on regression tasks. They also showed that this perspective subsumes induction-head behavior as a specific case of in-context learning by gradient descent.

Akyurek et al. (2023) investigated what specific learning algorithm in-context learning implements by studying transformers trained on linear regression tasks. They found that the in-context predictions of trained transformers closely match the predictions of gradient descent, ridge regression, and exact least-squares regression, depending on the model's depth and the noise level in the training data. Their probing analyses suggested that deeper layers of the transformer encode weight vectors and moment matrices in ways that resemble classical statistical estimators.

Study	Key finding	Setting
Dai et al. (2023)	Attention has a dual form of gradient descent; ICL behaves like implicit fine-tuning	Real NLP tasks
Von Oswald et al. (2023)	Linear self-attention exactly implements gradient descent; transformers are mesa-optimizers	Linear regression
Akyurek et al. (2023)	ICL predictions match gradient descent, ridge regression, or least-squares depending on depth and noise	Linear models

implicit Bayesian inference

Xie et al. (2022) proposed an alternative theoretical framework, arguing that in-context learning is best understood as implicit Bayesian inference. Their central insight is that when pretraining data consists of documents drawn from a mixture of latent concepts (for example, different topics, writing styles, or domains), the language model learns to infer the latent concept underlying a given sequence in order to maintain long-range coherence. At inference time, the demonstration examples in a prompt provide evidence about the latent concept, and the model performs approximate Bayesian updating to narrow down which concept best explains the examples.

To test this theory, Xie et al. constructed GINC (Generative In-Context learning dataset), a synthetic pretraining dataset with explicit latent concept structure based on mixtures of hidden Markov models. They proved theoretically that in-context learning emerges in this setting and showed empirically that both transformers and LSTMs pretrained on GINC exhibit in-context learning behavior. They also reproduced realistic phenomena from the literature, such as sensitivity to demonstration ordering and cases where zero-shot can outperform few-shot, within the GINC setting. This work was published at ICLR 2022.

The Bayesian perspective and the gradient descent perspective are not mutually exclusive. Bayesian inference can be implemented through iterative optimization, and some researchers have argued that the two views describe the same underlying computation at different levels of abstraction.

induction heads and mechanistic accounts

Olsson et al. (2022), working at Anthropic, proposed a mechanistic explanation rooted in the internal circuitry of transformers. They identified "induction heads" as a specific type of attention head that implements a token-level pattern-completion algorithm: given a sequence of the form [A][B]...[A], an induction head predicts that B will follow.

A full induction circuit requires a composition of two attention heads in different layers. A previous-token head in an earlier layer marks each token with information about its predecessor, and an induction head in a later layer searches for that mark and copies the continuation. Because the circuit requires composition across layers, induction heads cannot exist in single-layer attention models, which is consistent with the observation that single-layer transformers do not exhibit ICL.

The key evidence for the induction-heads hypothesis comes from training dynamics. The authors observed that induction heads emerge at a specific point during training, and this emergence coincides precisely with a sudden, sharp increase in the model's in-context learning ability (measured as decreasing loss at later positions in the sequence). This co-occurrence is visible as a distinctive "bump" in the training loss curve, sometimes called the induction bump. Perturbations to the architecture that shift the bump in training time also shift the formation of induction heads correspondingly, providing causal evidence for the link.

For small, attention-only transformer models, Olsson et al. presented strong causal evidence: ablating induction heads directly impairs in-context learning performance. For larger models with MLP layers, the evidence is correlational, since induction heads are more difficult to isolate in complex architectures. The authors hypothesize that induction heads may constitute the primary mechanism underlying in-context learning in large transformers, though this remains an area of active investigation. Subsequent work, including Akyurek et al. (2024) on "In-context language learning: Architectures and algorithms," has explored richer circuits including "semantic" induction heads that match abstract token relationships rather than literal repetitions.

learning function classes

Garg et al. (2022) took an empirical approach, training transformers from scratch on synthetic in-context learning tasks to study what function classes transformers can learn in-context. Their key findings include:

Transformers can learn linear functions in-context with performance comparable to the optimal least-squares estimator.
They can learn sparse linear functions where in-context performance nearly matches Lasso regression.
They can learn two-layer neural networks where in-context performance matches neural networks trained on the same data with gradient descent.
They can learn decision trees in-context.
In-context learning is robust to certain forms of distribution shift between training and inference.

This work, published at NeurIPS 2022, helped establish that transformers can implement surprisingly powerful learning algorithms within their forward pass.

a unified view

The table below contrasts the major theoretical accounts.

Account	Core claim	Supporting work	Strength	Limitation
Implicit gradient descent	Forward pass executes steps of gradient descent on demonstrations	Dai et al. (2023); von Oswald et al. (2023); Akyurek et al. (2023)	Mathematically clean, explains regression-style ICL	Mostly proved for linear self-attention or simplified settings
Implicit Bayesian inference	Demonstrations narrow a posterior over latent concepts learned during pretraining	Xie et al. (2022)	Naturally explains why scale and pretraining diversity help	Hard to test directly in real models; requires latent-concept structure
Induction heads	Attention circuits perform pattern completion ([A][B]...[A] -> [B])	Olsson et al. (2022)	Strong mechanistic, causal evidence in small models	Correlational only in large MLP-equipped models
Function-class learning	Transformers can implement statistical estimators for specific function classes	Garg et al. (2022)	Quantifies the space of tasks transformers can solve in-context	Synthetic tasks; does not directly explain language ICL

task vectors and function vectors

A parallel line of research has investigated how the "knowledge" of a task is represented inside the model during in-context learning.

Hendel et al. (2023) discovered that in-context learning often compresses the information from all demonstration examples into a single vector in the model's residual stream, which they called a "task vector." This task vector acts as a compact representation of the task being performed. When extracted from one prompt and injected into a different forward pass (replacing the demonstrations), the task vector alone is sufficient to steer the model toward the correct task behavior. This work was published in the Findings of EMNLP 2023.

Independently and at nearly the same time, Todd et al. (2024), published at ICLR 2024, identified a closely related phenomenon they called "function vectors." Using causal mediation analysis, they showed that a small number of attention heads transport a compact representation of the demonstrated task, and this representation can be extracted as a vector with strong causal effects on model outputs. When a function vector is added to the model's residual stream at inference time, it causes the model to apply the corresponding function, even without any demonstration examples in the prompt. Function vectors also exhibit partial compositionality: summing the vectors for two tasks can produce behavior corresponding to a combined task.

These findings suggest that in-context learning involves two stages: first, the model reads the demonstrations and compresses them into an internal task representation; second, the model applies this task representation to the query input. This two-stage view aligns with both the Bayesian inference framework (where the task vector encodes the inferred latent concept) and the gradient descent framework (where the task vector encodes the implicit weight update).

chain-of-thought as in-context learning

A particularly important variant of ICL is chain-of-thought (CoT) prompting, introduced by Wei et al. (2022). Rather than presenting demonstrations as bare input-output pairs, CoT demonstrations include intermediate reasoning steps written in natural language. For example, a math word problem demonstration might include the explicit calculation steps before the final numeric answer.

Wei et al. showed that this simple change unlocks substantial gains on tasks requiring multi-step reasoning, including arithmetic, commonsense, and symbolic reasoning. With only eight CoT demonstrations, a 540-billion-parameter model achieved state-of-the-art performance on the GSM8K math benchmark, surpassing a fine-tuned GPT-3 baseline equipped with a verifier. Crucially, the gains from CoT only appeared at sufficient scale: smaller models did not benefit and sometimes performed worse. This made CoT one of the canonical demonstrations of an emergent capability tied to model size.

CoT can be combined with self-consistency, introduced by Wang et al. (2022). Instead of generating a single CoT and reading the final answer, self-consistency samples multiple chains of thought from the model and selects the most common final answer by majority vote. This approach improved chain-of-thought performance on GSM8K by 17.9 percentage points and produced large gains on a range of arithmetic and commonsense benchmarks.

Later extensions further refined CoT-style ICL:

Least-to-most prompting (Zhou et al., 2022) decomposes a problem into easier subproblems and chains them, with each subproblem's solution feeding into the next.
Step-back prompting (Zheng et al., 2023) first asks the model to abstract the problem into a higher-level concept before answering.
Reasoning models released in 2024 and 2025, such as OpenAI's o1 and o3, DeepSeek-R1, and Anthropic's Claude with extended thinking, internalize chain-of-thought-style reasoning into a separate inference-time process trained with reinforcement learning, partially shifting reasoning effort from prompt-based ICL to model-internal computation.

factors affecting in-context learning

model scale

Model scale is one of the strongest predictors of in-context learning ability. Brown et al. (2020) observed that while zero-shot performance improves steadily with model size, few-shot performance increases more rapidly. This means that the benefit of providing demonstrations grows as models become larger. At the scale of GPT-3's 175 billion parameters, in-context learning becomes reliable across a broad range of tasks.

Wei et al. (2023) showed that model scale also affects how models use in-context demonstrations. Small models tend to rely on semantic priors from pretraining and largely ignore the input-label mappings in demonstrations. Large models, by contrast, can override their semantic priors when the demonstrations present conflicting information (for example, flipped labels where positive examples are labeled "Negative"). This ability to override priors is an emergent property that appears at sufficient scale. The same study found that instruction tuning strengthens both reliance on semantic priors and the ability to learn arbitrary input-label mappings, but tilts the balance more toward the former.

example selection

The choice of which examples to include in the prompt significantly affects in-context learning performance.

Liu et al. (2021), in a paper later commonly cited as the KATE method (kNN-Augmented in-conText Example selection), proposed selecting demonstration examples based on their semantic similarity to the query input in an embedding space. Their approach retrieves the nearest neighbors of the test input and uses them as demonstrations. This consistently outperformed random example selection across multiple NLU and NLG benchmarks, including a 41.9% gain on table-to-text generation (ToTTo) and 45.5% on open-domain question answering (Natural Questions) when paired with task-tuned sentence encoders.

Other research has shown that selecting diverse examples that cover different aspects of the task can also improve performance. Active selection schemes that combine similarity and coverage often outperform either alone. The optimal selection strategy depends on the task, the model, and the number of demonstrations available.

order sensitivity

Lu et al. (2022) demonstrated that the order in which demonstration examples appear in the prompt can dramatically affect performance. In some cases, the difference between the best and worst orderings spans from near state-of-the-art accuracy to random-chance accuracy. This sensitivity exists across model sizes, meaning that even the largest models are not immune to it. The optimal order is also model-specific: a strong order for one model does not necessarily transfer to another.

The authors proposed a method based on entropy statistics to identify high-performing orderings without access to a development set. Their approach generates a synthetic development set using the language model itself and selects orderings that produce low-entropy (confident) predictions on this set. This method achieved a 13% relative improvement for GPT-family models across eleven classification tasks.

what demonstrations actually contribute

A surprising finding by Min et al. (2022) revealed that correct input-label mappings in demonstrations are not always necessary for in-context learning. Randomly replacing the labels in demonstration examples barely affected performance on a range of classification and multiple-choice tasks, consistently across 12 different models including GPT-3.

What does matter, according to their analysis, is that the demonstrations convey:

The label space (what possible outputs exist).
The distribution of the input text (what the inputs look like).
The overall format of the sequence (how inputs and outputs are structured).

This result suggests that at least for certain task types, in-context learning relies more on task specification (telling the model what kind of task to perform) than on task learning (learning the input-output mapping from examples). However, Wei et al. (2023) later showed that larger models do learn from the actual input-label mappings, especially when the mappings contradict the model's semantic priors. So the importance of correct labels appears to increase with model scale, and the Min et al. result is best interpreted as a property of mid-scale models rather than a universal claim about ICL.

prompt format and template

Beyond example selection and ordering, the specific formatting of the prompt matters. Small changes to the template (such as the separator between input and output, the label words used, the use of an explicit task instruction, or the casing and punctuation of label tokens) can meaningfully affect performance. This sensitivity has motivated research on automatic prompt optimization, soft prompt tuning, and structured-output formats such as JSON schemas, which can stabilize behavior on classification tasks while constraining the output.

biases in few-shot prompting

Zhao et al. (2021), in "Calibrate Before Use," identified three systematic biases in few-shot ICL:

Majority-label bias: the model tends to predict labels that appear most often in the demonstrations.
Recency bias: the model is biased toward labels that appear near the end of the prompt.
Common-token bias: the model favors labels whose surface forms appear frequently in the pretraining data (for example, "book" over rarer words).

They proposed a contextual calibration procedure: estimate the model's bias by querying it with a content-free input such as "N/A," then fit calibration parameters that make the prediction uniform across labels. This adjustment improved few-shot accuracy of GPT-3 and GPT-2 by up to 30.0 percentage points and reduced variance across prompt choices. Subsequent methods such as batch calibration extended these ideas to broader settings.

meta-training for in-context learning

While in-context learning emerges naturally from ordinary pretraining at sufficient scale, several research efforts have investigated training procedures designed to amplify it.

Min et al. (2022), in "MetaICL: Learning to Learn In Context," explicitly meta-trained a pretrained language model on 142 NLP datasets, presenting tasks in an in-context format during fine-tuning. The resulting model showed improved few-shot ICL on held-out tasks, in some cases matching or exceeding fully fine-tuned models with eight times as many parameters. The gains were largest when meta-training tasks were diverse and when target tasks involved domain shift.

Instruction tuning (such as FLAN, T0, and the post-training of GPT-4, Claude, and Gemini models) can be viewed as a related strategy: by training the model on a wide range of instruction-following examples, the model becomes better at zero-shot and few-shot generalization to new tasks specified in natural language. Wei et al. (2023) noted, however, that instruction tuning shifts the relative balance between reliance on semantic priors and learning from in-context demonstrations.

in-context learning vs. fine-tuning

In-context learning and fine-tuning represent two fundamentally different approaches to adapting a pretrained model to a new task.

Aspect	In-context learning	Fine-tuning
Parameter updates	None	Yes (gradient-based)
Training data needed	A few examples (fits in prompt)	Typically hundreds to thousands of examples
Computational cost	Single forward pass	Multiple epochs of training
Task switching	Instant (change the prompt)	Requires separate fine-tuned model per task
Performance ceiling	Generally lower than fine-tuning for specialized tasks	Higher with sufficient data
Risk of catastrophic forgetting	None (weights unchanged)	Yes, especially without careful regularization
Storage	Single model serves all tasks	Separate model (or adapter) per task
Requires labeled data	Minimally (a few examples)	Yes (labeled dataset)
Inference cost per request	Higher (prompt tokens are paid each call)	Lower once weights absorb the task
Privacy of training data	Data is kept inside prompts (which may be logged by providers)	Data is absorbed into weights and not directly recoverable
Suitability for long horizons	Limited to context window	Unlimited

Dai et al. (2023) showed that in-context learning and fine-tuning produce similar internal representations and behavioral patterns, supporting the view that ICL is a form of implicit fine-tuning. However, fine-tuning generally achieves higher performance when sufficient labeled data is available, because it can make persistent changes to the model's weights rather than relying on limited prompt space.

In practice, the choice between ICL and fine-tuning depends on the use case. In-context learning is preferred when labeled data is scarce, rapid task switching is needed, the cost of fine-tuning is prohibitive, or low-latency adaptation is required. Fine-tuning is preferred when maximum task performance is required, sufficient training data exists, and the cost of running long prompts at inference would be prohibitive.

With the rise of cheap parameter-efficient fine-tuning techniques such as LoRA and adapters, fine-tuning has become a more attractive alternative to ICL even for moderately sized datasets. Practical systems often combine the two: a model is fine-tuned for a task family and then steered with in-context examples within each invocation.

many-shot in-context learning

Traditional few-shot ICL is limited by the model's context window. Early models like GPT-3 had a context window of 2,048 or 4,096 tokens, constraining the number of demonstrations to a handful. As context windows have expanded to 128,000 tokens (in GPT-4 Turbo and Claude 3 Opus), 200,000 tokens (Claude 3 family), and 1 million or more tokens (Gemini 1.5 Pro and Gemini 2.0), a new regime of "many-shot" in-context learning has become feasible.

Agarwal et al. (2024) systematically studied this regime and found:

Many-shot ICL consistently outperforms few-shot ICL across generative and discriminative tasks.
Performance gains from additional demonstrations follow a roughly log-linear scaling pattern, with diminishing returns at very high example counts.
Many-shot ICL is effective at overriding pretraining biases, unlike few-shot ICL.
Many-shot ICL can learn high-dimensional functions with numerical inputs.
For some tasks, many-shot ICL approaches the performance of full fine-tuning.

The main bottleneck for many-shot ICL is the availability of human-labeled demonstrations. To address this, Agarwal et al. proposed two variants:

Reinforced ICL: uses model-generated chain-of-thought rationales filtered by correctness as demonstrations, removing the need for human-written rationales.
Unsupervised ICL: drops rationales entirely and prompts the model with domain-specific questions only, relying on the prior probability of correct reasoning patterns.

Both approaches proved effective, particularly on complex reasoning tasks.

The Gemini 1.5 technical report (Reid et al., 2024) provided a striking real-world demonstration of long-context ICL: when given a 500-page reference grammar, a bilingual dictionary, and roughly 400 parallel sentences for Kalamang, a Papuan language with fewer than 200 speakers, Gemini 1.5 Pro learned to translate from English into Kalamang at a quality comparable to a human learner who studied the same materials. No example of Kalamang appears in the model's pretraining data, making the result a clean demonstration of pure in-context language acquisition.

Many-shot ICL has reshaped practical thinking about the ICL-versus-fine-tuning tradeoff: for many tasks, simply pasting hundreds or thousands of examples into the prompt is now a viable alternative to building a custom training pipeline.

prompting techniques and best practices

The following table summarizes practical techniques drawn from the ICL literature.

Technique	Idea	Reference / origin
Random selection	Pick demonstrations uniformly from a pool	Brown et al. (2020)
Similarity-based selection (KATE)	Retrieve demonstrations by embedding similarity to query	Liu et al. (2021)
Diversity-aware selection	Combine similarity with coverage to avoid redundancy	Various follow-ups (e.g., Su et al., 2023)
Order optimization	Search over orderings using entropy on a synthetic dev set	Lu et al. (2022)
Contextual calibration	Subtract content-free baseline from output probabilities	Zhao et al. (2021)
Instructional priming	Add a clear natural-language task instruction before demonstrations	Brown et al. (2020)
Chain-of-thought demonstrations	Include explicit intermediate reasoning steps	Wei et al. (2022)
Self-consistency	Sample multiple CoTs and majority-vote answers	Wang et al. (2022)
Least-to-most prompting	Decompose into subproblems and chain solutions	Zhou et al. (2022)
Step-back prompting	Abstract to higher-level concept before answering	Zheng et al. (2023)
Many-shot prompting	Include hundreds to thousands of examples in long contexts	Agarwal et al. (2024)
Reinforced ICL	Use model-generated rationales as demonstrations	Agarwal et al. (2024)
Structured output schemas	Force JSON or tagged output format to reduce parsing errors	Production practice; OpenAI structured outputs (2024)
Format consistency	Keep separator, capitalization, and field names consistent across demonstrations	Min et al. (2022); Wei et al. (2023)

a typical workflow

A practical few-shot prompt construction workflow looks roughly like this:

Write a clear task instruction at the top of the prompt.
Choose demonstrations that are representative of the input distribution and cover the label space.
Optionally retrieve demonstrations dynamically per query using semantic similarity.
Use a consistent format (for example, Input: ... Output: ...) and end the prompt with the query input followed by the same separator.
Stabilize the output via structured-output schemas or stop sequences.
Evaluate on a held-out set, varying example order and selection seeds to estimate variance.

structured output and tool use

In modern production systems, in-context demonstrations are increasingly used not just to teach a task but to teach a format. JSON schemas, function-calling specifications, and tool-use traces are typical patterns. A small number of input-output examples followed by a JSON schema can reliably produce parseable outputs, even from models that were not explicitly fine-tuned on that schema.

multimodal and cross-domain in-context learning

In-context learning is not limited to text. Several efforts have demonstrated ICL in multimodal settings.

Flamingo, introduced by Alayrac et al. (2022) at DeepMind, is a vision-language model that bridges a frozen pretrained language model with a vision encoder via gated cross-attention layers. By training on interleaved image-text data, Flamingo acquired strong few-shot ICL on vision-language tasks such as visual question answering and captioning. Subsequent open-source models including OpenFlamingo and Idefics replicate this capability. Studies of ICL in vision-language models, including Chen et al. (2024) and Baldassini et al. (2024), have found that the textual portion of demonstrations carries most of the task information, with images contributing relatively less.

In-context learning behavior has also been observed in models trained on tabular data, time series, and reinforcement-learning trajectories. Garg et al. (2022) and follow-ups demonstrated ICL on synthetic regression and classification tasks, while Laskin et al. (2023), in "In-context Reinforcement Learning with Algorithm Distillation," showed that transformers trained on long sequences of trajectories can learn new RL tasks in-context.

limitations and open questions

Despite its practical utility, in-context learning has several well-documented limitations.

sensitivity to prompt design

As discussed above, in-context learning is sensitive to the choice, ordering, and formatting of demonstration examples. This fragility means that small, seemingly inconsequential changes to the prompt can cause large swings in performance. Reliably optimizing prompts often requires experimentation, automated search methods, or explicit calibration.

limited context window

Although context windows have grown substantially, they still impose an upper bound on the amount of information that can be provided as demonstrations. For tasks that require learning from large datasets, in-context learning cannot match the capacity of fine-tuning, which can iterate over arbitrary amounts of data through multiple epochs. Long-context inference is also computationally expensive: attention costs scale at least linearly with prompt length, and very long prompts can lead to higher latency and cost per request.

difficulty with complex reasoning

While in-context learning works well for pattern matching and classification tasks, it struggles with tasks that require multi-step reasoning or compositional generalization beyond the patterns in the demonstrations. Chain-of-thought prompting can partially address this limitation by providing intermediate reasoning steps in the demonstrations, but it does not fully close the gap. Reasoning-trained models that perform extended internal computation at inference time, such as o1 and similar systems released in 2024 and 2025, partially substitute for ICL on reasoning-heavy tasks.

lack of persistent learning

In-context learning is ephemeral. Each new prompt starts from scratch, with no memory of previous interactions. The model cannot accumulate knowledge across sessions or build on previous in-context learning episodes. This is by design (the weights are frozen), but it means that in-context learning is not a substitute for training or fine-tuning when persistent adaptation is needed. Long-term memory systems and retrieval-augmented architectures attempt to compensate for this gap by externalizing the knowledge store.

theoretical understanding remains incomplete

While the theoretical frameworks discussed in this article (gradient descent, Bayesian inference, induction heads) have provided valuable insights, none of them fully explains in-context learning in large, practical language models. Most theoretical results apply to simplified settings (linear models, small transformers, synthetic data), and the extent to which they generalize to models with hundreds of billions of parameters and MLP layers remains an open question. Surveys such as Dong et al. (2024) and "The Mystery of In-Context Learning" (EMNLP 2024) catalog dozens of partial mechanisms that may co-occur in real models.

potential for biased or incorrect learning

Because in-context learning depends heavily on the provided demonstrations, biased or unrepresentative examples can lead the model to produce biased outputs. Beyond the systematic biases identified by Zhao et al. (2021), demonstrations can implicitly encode the worldview, formatting habits, or prejudices of whoever wrote them. Unlike fine-tuning, where bias mitigation techniques can be applied during training, there is limited ability to control for bias in the in-context learning setting beyond careful curation of the demonstrations.

prompt injection and security

Because ICL treats demonstrations and user input on roughly equal footing inside a single context window, models can be manipulated by adversarial content placed in either part of the prompt. This concern, known as prompt injection, becomes more acute as models are increasingly deployed as agents that read external documents, web pages, or tool outputs as part of their context. ICL's strength (taking arbitrary instructions seriously) is also a vulnerability.

applications

In-context learning has found widespread use across many areas of NLP and beyond.

Application	How ICL is used
Text classification	Demonstrations show input texts paired with category labels
Machine translation	Source-target sentence pairs serve as demonstrations
Question answering	Question-answer pairs demonstrate the desired format and reasoning
Code generation	Input-output pairs or natural language descriptions paired with code
Summarization	Document-summary pairs establish the desired compression level and style
Data extraction	Examples show how to extract structured information from unstructured text
Reasoning tasks	Chain-of-thought demonstrations provide step-by-step reasoning templates
Format conversion	Examples demonstrate the mapping between data formats (e.g., JSON to CSV)
Tool and function calling	Demonstrations show how to invoke a tool or API with the right arguments
Multimodal tasks	Image-caption or image-question pairs serve as demonstrations in vision-language models
Low-resource translation	Pairing a grammar plus parallel sentences in the prompt enables translation for endangered languages (Kalamang case)
Personalization	A small set of user-specific examples adapt model behavior to a particular style or persona
Guardrails and policy enforcement	Demonstrations encode the desired refusal behavior or safety constraints

explain like I'm 5 (ELI5)

Imagine you are doing a new kind of worksheet at school that you have never seen before. Your teacher shows you two completed examples at the top of the page so you can see the pattern. Then you try to do the next one on your own by copying what the examples did.

That is basically what in-context learning is. A big language model (like a very smart parrot that has read billions of sentences) gets shown a few "here is the question, here is the answer" examples right before a new question. The model looks at those examples, figures out the pattern, and then answers the new question the same way. The interesting part is that nobody had to re-teach the model anything. It just looked at the examples and figured it out on the spot.

If you give it more examples, it usually gets better. If you give it the wrong-looking examples or write them in a confusing order, it can get confused. And when you start a new conversation, it forgets everything, because it never actually wrote anything down in its long-term memory.

references

Brown, T. B., Mann, B., Ryder, N., et al. (2020). "Language Models are Few-Shot Learners." Advances in Neural Information Processing Systems (NeurIPS), 33, 1877-1901. arXiv:2005.14165. https://arxiv.org/abs/2005.14165
Radford, A., Wu, J., Child, R., et al. (2019). "Language Models are Unsupervised Multitask Learners." OpenAI technical report.
Olsson, C., Elhage, N., Nanda, N., et al. (2022). "In-context Learning and Induction Heads." Transformer Circuits Thread. arXiv:2209.11895. https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html
Xie, S. M., Raghunathan, A., Liang, P., & Ma, T. (2022). "An Explanation of In-context Learning as Implicit Bayesian Inference." Proceedings of ICLR 2022. arXiv:2111.02080. https://arxiv.org/abs/2111.02080
Von Oswald, J., Niklasson, E., Randazzo, E., et al. (2023). "Transformers Learn In-Context by Gradient Descent." Proceedings of the 40th International Conference on Machine Learning (ICML). arXiv:2212.07677. https://arxiv.org/abs/2212.07677
Dai, D., Sun, Y., Dong, L., et al. (2023). "Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers." Findings of ACL 2023. arXiv:2212.10559. https://arxiv.org/abs/2212.10559
Akyurek, E., Schuurmans, D., Andreas, J., Ma, T., & Zhou, D. (2023). "What Learning Algorithm is In-Context Learning? Investigations with Linear Models." Proceedings of ICLR 2023. arXiv:2211.15661. https://arxiv.org/abs/2211.15661
Akyurek, E., Wang, B., Kim, Y., & Andreas, J. (2024). "In-context Language Learning: Architectures and Algorithms." Proceedings of ICML 2024. arXiv:2401.12973.
Min, S., Lyu, X., Holtzman, A., et al. (2022). "Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?" Proceedings of EMNLP 2022. arXiv:2202.12837. https://arxiv.org/abs/2202.12837
Min, S., Lewis, M., Zettlemoyer, L., & Hajishirzi, H. (2022). "MetaICL: Learning to Learn In Context." Proceedings of NAACL 2022. arXiv:2110.15943. https://arxiv.org/abs/2110.15943
Lu, Y., Bartolo, M., Moore, A., Riedel, S., & Stenetorp, P. (2022). "Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity." Proceedings of ACL 2022, 8086-8098. arXiv:2104.08786. https://arxiv.org/abs/2104.08786
Liu, J., Shen, D., Zhang, Y., Dolan, B., Carin, L., & Chen, W. (2022). "What Makes Good In-Context Examples for GPT-3?" Proceedings of DeeLIO 2022. arXiv:2101.06804. https://arxiv.org/abs/2101.06804
Hendel, R., Geva, M., & Globerson, A. (2023). "In-Context Learning Creates Task Vectors." Findings of EMNLP 2023. arXiv:2310.15916. https://arxiv.org/abs/2310.15916
Todd, E., Li, M. L., Sharma, A. S., et al. (2024). "Function Vectors in Large Language Models." Proceedings of ICLR 2024. arXiv:2310.15213. https://arxiv.org/abs/2310.15213
Garg, S., Tsipras, D., Liang, P., & Valiant, G. (2022). "What Can Transformers Learn In-Context? A Case Study of Simple Function Classes." Advances in Neural Information Processing Systems (NeurIPS), 35. arXiv:2208.01066. https://arxiv.org/abs/2208.01066
Wei, J., Wei, J., Tay, Y., et al. (2023). "Larger Language Models Do In-Context Learning Differently." arXiv:2303.03846. https://arxiv.org/abs/2303.03846
Wei, J., Wang, X., Schuurmans, D., et al. (2022). "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." Advances in Neural Information Processing Systems (NeurIPS), 35. arXiv:2201.11903. https://arxiv.org/abs/2201.11903
Wang, X., Wei, J., Schuurmans, D., et al. (2022). "Self-Consistency Improves Chain of Thought Reasoning in Language Models." Proceedings of ICLR 2023. arXiv:2203.11171. https://arxiv.org/abs/2203.11171
Zhou, D., Schaerli, N., Hou, L., et al. (2022). "Least-to-Most Prompting Enables Complex Reasoning in Large Language Models." arXiv:2205.10625.
Zheng, H. S., Mishra, S., Chen, X., et al. (2023). "Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models." arXiv:2310.06117.
Agarwal, R., Singh, A., Zhang, L. M., et al. (2024). "Many-Shot In-Context Learning." Advances in Neural Information Processing Systems (NeurIPS). arXiv:2404.11018. https://arxiv.org/abs/2404.11018
Zhao, T. Z., Wallace, E., Feng, S., Klein, D., & Singh, S. (2021). "Calibrate Before Use: Improving Few-Shot Performance of Language Models." Proceedings of ICML 2021, 12697-12706. arXiv:2102.09690. https://arxiv.org/abs/2102.09690
Alayrac, J.-B., Donahue, J., Luc, P., et al. (2022). "Flamingo: a Visual Language Model for Few-Shot Learning." Advances in Neural Information Processing Systems (NeurIPS), 35. arXiv:2204.14198. https://arxiv.org/abs/2204.14198
Laskin, M., Wang, L., Oh, J., et al. (2023). "In-context Reinforcement Learning with Algorithm Distillation." Proceedings of ICLR 2023. arXiv:2210.14215.
Reid, M., Savinov, N., Teplyashin, D., et al. (2024). "Gemini 1.5: Unlocking Multimodal Understanding Across Millions of Tokens of Context." arXiv:2403.05530. https://arxiv.org/abs/2403.05530
Dong, Q., Li, L., Dai, D., et al. (2024). "A Survey on In-context Learning." Proceedings of EMNLP 2024. arXiv:2301.00234. https://arxiv.org/abs/2301.00234

introduction

terminology and origin

how in-context learning works

the role of the attention mechanism

the forward pass as a learning algorithm

a brief history

theoretical explanations

implicit gradient descent

implicit Bayesian inference

induction heads and mechanistic accounts

learning function classes

a unified view

task vectors and function vectors

chain-of-thought as in-context learning

factors affecting in-context learning

model scale

example selection

order sensitivity

what demonstrations actually contribute

prompt format and template

biases in few-shot prompting

meta-training for in-context learning

in-context learning vs. fine-tuning

many-shot in-context learning

prompting techniques and best practices

a typical workflow

structured output and tool use

multimodal and cross-domain in-context learning

limitations and open questions

sensitivity to prompt design

limited context window

difficulty with complex reasoning

lack of persistent learning

theoretical understanding remains incomplete

potential for biased or incorrect learning

prompt injection and security

applications

explain like I'm 5 (ELI5)

see also

references

Improve this article

Related Articles

Context window

Post-training

Sparse autoencoder

OCR Models

Pre-training

Supervised fine-tuning

introduction

terminology and origin

how in-context learning works

the role of the attention mechanism

the forward pass as a learning algorithm

a brief history

theoretical explanations

implicit gradient descent

implicit Bayesian inference

induction heads and mechanistic accounts

learning function classes

a unified view

task vectors and function vectors

chain-of-thought as in-context learning

factors affecting in-context learning

model scale

example selection

order sensitivity

what demonstrations actually contribute

prompt format and template

biases in few-shot prompting

meta-training for in-context learning

in-context learning vs. fine-tuning

many-shot in-context learning

prompting techniques and best practices

a typical workflow

structured output and tool use

multimodal and cross-domain in-context learning

limitations and open questions

sensitivity to prompt design

limited context window

difficulty with complex reasoning