Beam search is a search algorithm used in sequence generation tasks that explores multiple candidate sequences in parallel, retaining only the top-scoring hypotheses at each decoding step. Originally developed for general heuristic search in artificial intelligence, beam search became a cornerstone decoding method in natural language processing, machine translation, speech recognition, and image captioning throughout the 2010s and into the 2020s. While sampling-based methods have largely supplanted beam search for open-ended text generation in modern large language models, beam search remains widely used in tasks that demand high-fidelity, deterministic output.
At its heart, beam search is a breadth-limited tree search. Given a model that produces probability distributions over possible next tokens at each step, beam search maintains a fixed number of partially completed sequences (called "beams" or "hypotheses") and extends them one token at a time. At every step, the algorithm considers all possible one-token extensions of every active beam, scores them, and retains only the top-k candidates, where k is known as the beam width. This process repeats until all beams reach an end-of-sequence token or a maximum length.
Beam search can be understood as a middle ground between two extremes. On one side, exhaustive search would evaluate every possible output sequence, guaranteeing the globally optimal result but at a computational cost that grows exponentially with sequence length. On the other side, greedy decoding picks the single most probable token at each step, which is fast but frequently leads to suboptimal sequences because a locally high-probability token can steer the decoder toward a poor overall output. Beam search balances these tradeoffs by exploring multiple paths simultaneously without attempting to enumerate all of them [1].
The beam search algorithm proceeds through a series of clearly defined stages.
The algorithm starts with a single beam containing only the start-of-sequence token (often denoted <sos> or <bos>). The cumulative log-probability of this initial beam is set to zero.
At each time step t, the algorithm takes every active beam and computes the model's probability distribution over the full vocabulary for the next token. If the beam width is k and the vocabulary size is V, this produces up to k * V candidate extensions. Each candidate's score is the cumulative log-probability of the parent beam plus the log-probability of the new token.
From all k * V candidates, the algorithm selects the top k sequences by score. All other candidates are discarded. This pruning step is what keeps the computation tractable: regardless of vocabulary size, only k beams survive each round.
When a beam generates an end-of-sequence token (e.g., <eos>), that beam is considered complete and is moved to a finished list. The search continues for the remaining active beams. The algorithm terminates when all k beams have finished, when a maximum decoding length is reached, or when no active beams remain. From the finished list, the highest-scoring complete sequence is returned as the output.
Consider a simple case with beam width k = 2 and a vocabulary of {A, B, C, <eos>}. At step 1, starting from <sos>, suppose the model assigns the highest probabilities to token A (log-prob -0.3) and token B (log-prob -0.5). These become the two active beams.
At step 2, the algorithm expands both beams. Beam [A] might generate candidates [A, A] with score -0.7, [A, B] with score -0.9, and [A, C] with score -1.2. Beam [B] might generate [B, A] with score -0.8, [B, C] with score -1.0, and [B, B] with score -1.5. From all six candidates, the top 2 are kept: [A, A] at -0.7 and [B, A] at -0.8. The process continues until completion.
Beam width (also called beam size) is the single most important hyperparameter in beam search. It controls the tradeoff between search thoroughness and computational cost.
| Beam Width | Behavior | Typical Use |
|---|---|---|
| k = 1 | Equivalent to greedy decoding; only one hypothesis is maintained | Fast prototyping, latency-critical applications |
| k = 4-5 | Standard setting for machine translation benchmarks | Research benchmarks, production translation systems |
| k = 10-20 | Broader search; diminishing returns in many tasks | Speech recognition, some translation tasks |
| k = 50-100 | Very broad search; high computational cost | Specialized applications, reranking experiments |
| k = 200+ | Rarely used; often leads to degraded output quality | Research on beam search pathologies |
Increasing beam width generally improves output quality up to a point, but the relationship is not monotonic. Several studies have documented the "beam search curse" or "beam search degradation," where very large beam widths actually produce worse translations than moderate ones [2]. This occurs because the model's score function does not perfectly correlate with human quality judgments, and exhaustively optimizing the model's score can surface degenerate outputs (such as very short or repetitive sequences) that score highly under the model but are poor by human standards.
Computational cost scales linearly with beam width. Each additional beam requires a full forward pass through the model at every time step, so doubling the beam width approximately doubles the decoding time. Memory usage also scales linearly, as the algorithm must store the hidden states and partial sequences for all active beams.
Greedy decoding is the simplest possible decoding strategy: at each step, select the single token with the highest probability and continue. It is equivalent to beam search with beam width 1.
| Aspect | Greedy Decoding | Beam Search |
|---|---|---|
| Speed | Fastest (one forward pass per step) | k times slower than greedy |
| Output quality | Frequently suboptimal | Generally better for structured tasks |
| Determinism | Deterministic | Deterministic |
| Diversity | No diversity; single output | Low diversity across beams |
| Risk of local optima | High | Lower (explores k paths) |
The fundamental limitation of greedy decoding is that it can miss high-quality sequences that begin with a low-probability token. For example, in translation, the best overall rendering of a sentence might start with a word that is only the third most probable at position one but leads to a much better continuation. Greedy decoding would never discover this path, while beam search with k >= 3 would [3].
That said, greedy decoding has seen a surprising resurgence in certain settings. Finbarr Timbers and others have observed that when large language models are sufficiently well-trained and the temperature is set appropriately, greedy decoding can produce outputs of comparable quality to beam search for many tasks, particularly when the model is confident in its predictions [4].
Sampling-based decoding methods introduce controlled randomness into the generation process, producing diverse and often more natural-sounding text. The main sampling approaches include:
Temperature sampling scales the logits before applying softmax, with higher temperatures flattening the distribution (more randomness) and lower temperatures sharpening it (approaching greedy behavior).
Top-k sampling restricts the sampling pool to the k most probable tokens at each step, zeroing out probabilities for all other tokens. This prevents sampling from the long tail of unlikely tokens.
Top-p sampling (nucleus sampling), introduced by Holtzman et al. in 2019, dynamically selects the smallest set of tokens whose cumulative probability exceeds a threshold p. This adapts the candidate pool size to the model's confidence at each step.
| Aspect | Beam Search | Sampling (top-k / top-p) |
|---|---|---|
| Determinism | Deterministic output | Stochastic; different each run |
| Diversity | Low; beams often near-identical | High; produces varied outputs |
| Repetition | Prone to repetitive text | Less repetitive with proper settings |
| Best for | Translation, summarization, ASR | Creative writing, dialogue, open-ended generation |
| Quality control | Optimizes model score directly | Controlled randomness; may occasionally produce incoherent text |
A key insight from Holtzman et al. (2019) is that human language does not follow the distribution of highest-probability next words. People regularly produce text that would not rank as the top prediction of a language model, and this mild unpredictability is what makes human writing engaging rather than bland. Beam search, by always selecting high-probability sequences, tends to produce text that is generic and repetitive, especially for open-ended generation tasks [5].
A well-known problem with standard beam search is its bias toward shorter sequences. Because the score of a sequence is the sum of log-probabilities (all negative or zero), longer sequences accumulate more negative terms and therefore have lower raw scores. Without correction, beam search systematically prefers short outputs.
Length normalization addresses this by dividing the sequence score by a function of its length. The most common approach, introduced in the Google Neural Machine Translation (GNMT) system by Wu et al. (2016), uses the formula:
score(Y) = log P(Y|X) / lp(Y)
where lp(Y) = ((5 + |Y|) / 6)^alpha is the length penalty term. The hyperparameter alpha controls the strength of the normalization: alpha = 0 applies no normalization, while alpha = 1 applies full normalization. Values around 0.6 to 0.8 are commonly used in practice [6].
Wu et al. (2016) also introduced a coverage penalty for machine translation, which encourages the decoder to attend to all parts of the source sentence. The coverage penalty sums the log of the minimum of the accumulated attention weights and 1.0 for each source position, penalizing hypotheses that leave parts of the input untranslated. Together with length normalization, the coverage penalty improved BLEU scores by approximately 1 point in the GNMT system [6].
Standard beam search tends to produce a set of hypotheses that are nearly identical, often differing by only one or two tokens. This lack of diversity is problematic for applications that need a ranked list of meaningfully different options.
Diverse Beam Search (DBS), proposed by Vijayakumar et al. (2016), addresses this by dividing the beam into groups and adding a diversity penalty between groups. At each step, the algorithm decodes one group at a time. When scoring candidates for group g, a penalty term is applied based on how similar those candidates are to the outputs already selected in groups 1 through g-1. This encourages each group to explore different regions of the output space [7].
The diversity penalty can be based on Hamming distance, n-gram overlap, or other similarity measures. DBS has been shown to improve not only the diversity of the output list but also the quality of the top-1 result in some settings, because the diversity constraint implicitly regularizes the search and prevents it from converging prematurely on a narrow part of the search space [7].
Other approaches to improving beam diversity include:
Beam search has been the dominant decoding method in neural machine translation since the field's inception. The encoder-decoder architecture with attention, first popularized by Bahdanau et al. (2014) and Vaswani et al. (2017), produces translations one token at a time, and beam search provides the mechanism for finding high-quality output sequences.
Production translation systems, including Google Translate and Microsoft Translator, have historically relied on beam search with beam widths typically between 4 and 12. The deterministic nature of beam search is an advantage in translation, where users expect consistent output for the same input. Length normalization and coverage penalties are standard enhancements in this setting [6].
In automatic speech recognition (ASR), beam search is used to decode the output of acoustic models. The decoder must find the most likely word sequence given the audio input, and beam search provides an efficient way to explore the space of possible transcriptions.
Classic ASR systems applied beam search within the Viterbi algorithm framework, limiting the number of active states at each time step. Modern end-to-end ASR models, such as those based on Connectionist Temporal Classification (CTC) or attention-based encoder-decoder architectures, use beam search at the token or word level. Beam widths in ASR tend to be larger than in text generation, often ranging from 10 to 100, because the acoustic signal introduces more ambiguity than text-conditioned generation [8].
Integration with external language models is common in ASR beam search. The decoder score is augmented with a weighted language model score at each step, helping to prefer linguistically plausible transcriptions even when the acoustic evidence is ambiguous.
Beam search is widely used in image captioning models, where a visual encoder (typically a convolutional neural network or vision transformer) produces a representation of an image, and an autoregressive decoder generates a textual caption. Beam search helps produce grammatically correct and content-appropriate captions.
However, beam search in image captioning has been criticized for producing generic and repetitive captions. Because the model tends to assign high probability to safe, common descriptions (e.g., "a dog sitting on the grass"), beam search reinforces this tendency. Recent work on meshed context-aware beam search (MCBS) addresses this by dynamically adjusting the attention mechanism at each decoding step to focus on different image regions, producing more diverse and contextually appropriate captions [9].
In AI code generation, beam search has been used to produce syntactically valid programs by maintaining multiple candidate completions. The structured nature of programming languages makes beam search particularly useful, as invalid syntax can be detected and pruned early. Some code generation systems combine beam search with constraint decoding, where only syntactically valid token extensions are considered at each step.
The shift from beam search to sampling-based methods in modern large language models reflects several converging factors.
First, the primary use case for LLMs has shifted. Machine translation and summarization, where beam search excels, are no longer the dominant applications. Conversational AI, creative writing, brainstorming, and open-ended question answering require output that is varied, engaging, and natural-sounding. Beam search's tendency to produce the single most probable sequence works against these goals [5].
Second, modern LLMs are substantially larger and better trained than the models of the mid-2010s. With billions of parameters trained on trillions of tokens, these models have much sharper probability distributions over plausible continuations. The gap between greedy/beam search and sampling narrows as model quality improves, because the model's top predictions are already good.
Third, beam search is computationally expensive at inference time. For a model with hundreds of billions of parameters, each forward pass is costly, and maintaining k parallel beams multiplies that cost. Sampling with temperature, top-k, or top-p requires only a single forward pass per token, making it much more efficient for serving millions of users.
Fourth, beam search is deterministic, which is a liability for conversational applications. Users expect different responses to the same prompt across sessions, and developers want to be able to control the randomness of outputs. Sampling provides this control natively through the temperature parameter.
Fifth, research has shown that maximizing the probability of the output sequence (which is what beam search does) is not the same as maximizing output quality as judged by humans. The "likelihood trap" leads beam search to produce outputs that are probable but not necessarily good, a problem that worsens with very large beam widths [2].
As a result, the default decoding strategy for models like GPT-4, Claude, Gemini, and LLaMA is temperature-controlled sampling with top-p or top-k filtering, not beam search.
Researchers have identified several well-documented pathologies of beam search.
Cohen and Beck (2019) demonstrated that in neural machine translation, increasing beam width beyond a moderate value (around 5-10) often degrades translation quality as measured by BLEU score. This counterintuitive result occurs because the model's length bias, exposure bias, and other training artifacts become more pronounced as the search becomes more thorough. The "optimal" translation under the model's scoring function is not the translation that humans prefer [2].
Beam search frequently produces repetitive text, especially in language modeling (as opposed to translation). Sequences that contain repeating n-grams can accumulate high scores because the model learns that repeating a phrase is often locally probable. Various n-gram blocking heuristics have been proposed: for example, preventing any n-gram from appearing more than once in the output. While effective, these are patches rather than principled solutions.
Without length normalization, beam search can produce very short or even empty outputs, because the <eos> token often has a relatively high probability early in decoding. The cumulative score of a short sequence can exceed that of a longer, better sequence simply because fewer negative log-probability terms have been accumulated.
Over the years, numerous variants of beam search have been proposed to address its limitations.
| Variant | Key Idea | Reference |
|---|---|---|
| Length-normalized beam search | Divides score by a function of length | Wu et al. (2016) |
| Diverse beam search | Adds inter-group diversity penalty | Vijayakumar et al. (2016) |
| Constrained beam search | Restricts outputs to satisfy lexical constraints | Anderson et al. (2017) |
| Stochastic beam search | Uses Gumbel-top-k for probabilistic beam selection | Kool et al. (2019) |
| Iterative beam search | Refines beams through multiple passes | Various |
| Speculative beam search | Combines speculative decoding with beam search | Various |
| Creative beam search | Uses an LLM-as-judge to score beams | Zhao et al. (2024) |
Constrained beam search is particularly noteworthy. In applications like data-to-text generation or terminology-constrained translation, the output must contain specific words or phrases. Constrained beam search modifies the pruning step to ensure that beams satisfying more constraints are preferentially retained, even if their raw scores are lower.
Beam search is implemented in most major deep learning and NLP frameworks.
Hugging Face Transformers provides beam search through the generate() method with num_beams parameter. It supports length penalty, repetition penalty, no-repeat n-gram size, diverse beam search (via num_beam_groups), and constrained generation.
OpenNMT includes beam search with configurable length normalization, coverage penalty, and stepwise penalty functions, closely following the formulations of Wu et al. (2016).
Fairseq (Meta's sequence modeling toolkit) implements beam search with various scoring modifications and has been used extensively in machine translation research.
CTC beam search decoders for speech recognition are available in libraries such as pyctcdecode and NVIDIA's NeMo toolkit, with optional language model fusion.
As of early 2026, beam search occupies a well-defined niche in the broader landscape of decoding methods. For tasks where output fidelity and determinism are paramount (machine translation, speech recognition, structured data generation, code generation with syntactic constraints), beam search remains the method of choice. Production translation systems continue to use it, and ASR systems rely on it for accurate transcription.
For open-ended language generation, which now represents the majority of LLM use cases, sampling methods dominate. The trend toward ever-larger models has also reduced the relative benefit of beam search, since better models require less search to produce good outputs.
Recent research directions include combining beam search with reranking (generating a diverse set of candidates via sampling, then using beam search or a separate model to select the best one), using beam search within speculative decoding frameworks for faster inference, and applying constrained beam search to ensure outputs satisfy format requirements in structured generation tasks.
Minimum Bayes Risk (MBR) decoding has emerged as a compelling alternative that borrows ideas from both beam search and sampling. MBR generates multiple candidate translations (typically via sampling), then selects the candidate that maximizes expected utility across all candidates. This approach has been shown to outperform beam search on several machine translation benchmarks, and some researchers view it as the natural successor to beam search in translation [10].