Beam search is a heuristic search algorithm used in sequence generation tasks that explores multiple candidate sequences in parallel, retaining only the top-scoring hypotheses at each decoding step. Originally developed in classical artificial intelligence for problems like speech recognition and combinatorial planning, beam search became a cornerstone decoding method in natural language processing, machine translation, speech recognition, and image captioning throughout the 2010s and into the 2020s. While sampling-based methods have largely supplanted beam search for open-ended text generation in modern large language models, beam search remains widely used in tasks that demand high-fidelity, deterministic output and is still the default decoder in production neural machine translation and automatic speech recognition systems.
Beam search predates modern neural networks by several decades. The algorithm was introduced by Bruce Lowerre in his 1976 doctoral dissertation at Carnegie Mellon University, where it was used in the Harpy speech recognition system. Lowerre originally called the technique the "locus model of search," but the name "beam search" was already in common use by 1977 and stuck. Harpy was funded by the U.S. Defense Advanced Research Projects Agency (DARPA) and aimed to recognize connected speech with a 1,000-word vocabulary, a then-formidable task. Beam search made the search tractable by pruning unpromising hypotheses at each time step rather than expanding the full graph [11].
In classical AI, beam search is treated as a memory-bounded variant of best-first search. Where best-first search keeps all generated nodes in an open list ordered by a heuristic, beam search only retains the top K nodes at each level, discarding the rest forever. This trade reduces memory from exponential to linear in the search depth, but sacrifices completeness: the optimal solution can be pruned away if it does not appear in the top K at some intermediate step. Several variants restore completeness or anytime behavior, including beam-stack search by Zhou and Hansen (2005), which combines beam search with systematic backtracking to guarantee optimality given enough time, and breadth-first heuristic search [12].
Outside of speech recognition, early beam search applications included combinatorial scheduling, planning under DARPA-funded planning competitions, and protein structure prediction. The algorithm crossed into neural network decoding through statistical machine translation systems in the 2000s, where phrase-based decoders used beam search over partial translation hypotheses. When neural sequence-to-sequence models took over machine translation around 2014, beam search came along as the natural decoding strategy and has remained dominant in that field ever since.
At its heart, beam search is a breadth-limited tree search. Given a model that produces probability distributions over possible next tokens at each step, beam search maintains a fixed number of partially completed sequences (called "beams" or "hypotheses") and extends them one token at a time. At every step, the algorithm considers all possible one-token extensions of every active beam, scores them, and retains only the top-K candidates, where K is known as the beam width. This process repeats until all beams reach an end-of-sequence token or a maximum length.
Beam search can be understood as a middle ground between two extremes. On one side, exhaustive search would evaluate every possible output sequence, guaranteeing the globally optimal result but at a computational cost that grows exponentially with sequence length. On the other side, greedy decoding picks the single most probable token at each step, which is fast but frequently leads to suboptimal sequences because a locally high-probability token can steer the decoder toward a poor overall output. Beam search balances these tradeoffs by exploring multiple paths simultaneously without attempting to enumerate all of them [1].
Formally, given an autoregressive model defining a conditional distribution P(y_t | y_<t, x) over the vocabulary V at each step t, beam search approximates the maximum a posteriori (MAP) sequence:
y* = argmax_y log P(y | x) = argmax_y sum_{t=1}^{T} log P(y_t | y_<t, x)
Finding the exact argmax is intractable because the search space contains |V|^T possible sequences. Beam search returns a high-probability sequence in O(K * V * T) time per beam expansion, where K is the beam width, V is the vocabulary size, and T is the maximum decoding length.
The beam search algorithm proceeds through a series of clearly defined stages.
The algorithm starts with a single beam containing only the start-of-sequence token (often denoted <sos> or <bos>). The cumulative log-probability of this initial beam is set to zero.
At each time step t, the algorithm takes every active beam and computes the model's probability distribution over the full vocabulary for the next token. If the beam width is K and the vocabulary size is V, this produces up to K * V candidate extensions. Each candidate's score is the cumulative log-probability of the parent beam plus the log-probability of the new token. Working in log-probability space turns the product of probabilities into a sum, which avoids numerical underflow and lets the algorithm rank candidates with simple addition.
From all K * V candidates, the algorithm selects the top K sequences by score. All other candidates are discarded. This pruning step is what keeps the computation tractable: regardless of vocabulary size, only K beams survive each round. In practice, implementations use a top-K selection algorithm (such as a partial sort or a heap) on the K * V candidate scores. For modern transformer models with vocabulary sizes around 50,000 to 200,000 and beam widths in the single digits, this top-K step is fast compared to the model forward pass.
When a beam generates an end-of-sequence token (e.g., <eos>), that beam is considered complete and is moved to a finished list. The search continues for the remaining active beams. The algorithm terminates when all K beams have finished, when a maximum decoding length is reached, or when no active beams remain. From the finished list, the highest-scoring complete sequence is returned as the output. Many implementations also support an early_stopping flag that halts the search as soon as the top K finished hypotheses are guaranteed to score higher than any continuation of the remaining active beams.
Consider a simple case with beam width K = 2 and a vocabulary of {A, B, C, <eos>}. At step 1, starting from <sos>, suppose the model assigns the highest probabilities to token A (log-prob -0.3) and token B (log-prob -0.5). These become the two active beams.
At step 2, the algorithm expands both beams. Beam [A] might generate candidates [A, A] with score -0.7, [A, B] with score -0.9, and [A, C] with score -1.2. Beam [B] might generate [B, A] with score -0.8, [B, C] with score -1.0, and [B, B] with score -1.5. From all six candidates, the top 2 are kept: [A, A] at -0.7 and [B, A] at -0.8.
At step 3, suppose [A, A] expands to [A, A, <eos>] with score -0.9 and [A, A, C] with score -1.1, while [B, A] expands to [B, A, <eos>] with score -1.0 and [B, A, A] with score -1.2. The top 2 are now [A, A, <eos>] (which moves to the finished list) and [A, A, C] (which keeps growing). The process continues until every beam terminates or the maximum length is reached. The final output is the highest-scoring complete sequence from the finished list, possibly after applying length normalization.
Beam width (also called beam size, often written K, k, or B) is the single most important hyperparameter in beam search. It controls the tradeoff between search thoroughness and computational cost.
| beam width | behavior | typical use |
|---|---|---|
| K = 1 | Equivalent to greedy decoding; only one hypothesis is maintained | Fast prototyping, latency-critical applications, modern LLM chat |
| K = 4-5 | Standard setting for machine translation benchmarks | Research benchmarks, production translation systems |
| K = 5-10 | Common for summarization and image captioning | News summarization with BART, T5 |
| K = 10-20 | Broader search; diminishing returns in many tasks | Speech recognition, some translation tasks |
| K = 50-100 | Very broad search; high computational cost | Specialized ASR with language model rescoring, reranking experiments |
| K = 200+ | Rarely used; often leads to degraded output quality | Research on beam search pathologies and exact decoding studies |
Increasing beam width generally improves output quality up to a point, but the relationship is not monotonic. Several studies have documented the "beam search curse" or "beam search degradation," where very large beam widths actually produce worse translations than moderate ones [2]. Cohen and Beck (2019) showed empirically that this happens because increasing beam width tends to surface sequences disproportionately based on early, very low probability tokens followed by a sequence of tokens with higher conditional probability, and such sequences are more likely to have a lower evaluation score than higher-probability sequences without this pattern. Stahlberg and Byrne (2019) pushed this further, showing that for more than 50% of sentences in a standard NMT model, the global best score under the model is assigned to the empty translation, meaning that vanilla NMT in its current form requires just the right amount of beam search errors to produce useful output [13].
Computational cost scales linearly with beam width. Each additional beam requires a full forward pass through the model at every time step, so doubling the beam width approximately doubles the decoding time. Memory usage also scales linearly, as the algorithm must store the hidden states (key-value cache for transformer models) and partial sequences for all active beams. On modern accelerators, beam search is typically batched: the K beams form a single batch dimension and run through the model in parallel, so the wall-clock cost of K = 4 is closer to a small constant overhead over greedy than to 4x slowdown, especially when batch size and sequence length are the dominant factors in latency.
Greedy decoding is the simplest possible decoding strategy: at each step, select the single token with the highest probability and continue. It is equivalent to beam search with beam width 1.
| aspect | greedy decoding | beam search |
|---|---|---|
| Speed | Fastest (one forward pass per step) | K times slower than greedy in the worst case |
| Output quality | Frequently suboptimal | Generally better for structured tasks |
| Determinism | Deterministic | Deterministic |
| Diversity | No diversity; single output | Low diversity across beams |
| Risk of local optima | High | Lower (explores K paths) |
| Memory | Minimal | K times the activations and KV cache of greedy |
The fundamental limitation of greedy decoding is that it can miss high-quality sequences that begin with a low-probability token. For example, in translation, the best overall rendering of a sentence might start with a word that is only the third most probable at position one but leads to a much better continuation. Greedy decoding would never discover this path, while beam search with K >= 3 would [3].
That said, greedy decoding has seen a surprising resurgence in certain settings. Finbarr Timbers and others have observed that when large language models are sufficiently well-trained and the temperature is set appropriately, greedy decoding can produce outputs of comparable quality to beam search for many tasks, particularly when the model is confident in its predictions [4]. The distribution sharpening that comes from larger pretraining corpora and instruction tuning has reduced the relative gap between greedy and beam, which is part of why modern LLM serving stacks like vLLM and Text Generation Inference often default to greedy or sampling rather than beam.
Sampling-based decoding methods introduce controlled randomness into the generation process, producing diverse and often more natural-sounding text. The main sampling approaches include:
Temperature sampling scales the logits before applying softmax, with higher temperatures flattening the distribution (more randomness) and lower temperatures sharpening it (approaching greedy behavior). At temperature 0, sampling becomes deterministic and is identical to greedy.
Top-k sampling restricts the sampling pool to the k most probable tokens at each step, zeroing out probabilities for all other tokens. This prevents sampling from the long tail of unlikely tokens.
Top-p sampling, also known as nucleus sampling and introduced by Holtzman et al. in 2019, dynamically selects the smallest set of tokens whose cumulative probability exceeds a threshold p. This adapts the candidate pool size to the model's confidence at each step.
The table below compares the most common decoding strategies in modern LLM inference.
| method | deterministic | diversity | repetition | typical use |
|---|---|---|---|---|
| Greedy (K=1) | Yes | None | High | Fast inference, structured tasks, low-latency chat |
| Beam search (K=4-10) | Yes | Low | Medium | NMT, summarization, ASR, code generation with constraints |
| Diverse beam search | Yes (given groups) | Medium | Lower than beam | N-best lists, image captioning, reranking |
| Temperature sampling | No | Tunable | Lower with higher T | Open-ended generation, creative writing |
| Top-k sampling | No | Bounded | Low | Story generation, dialogue |
| Top-p / nucleus | No | Adaptive | Low | Modern LLM chat (most common) |
| Min-p sampling | No | Adaptive | Low | Newer LLM serving stacks (2024+) |
A key insight from Holtzman et al. (2019) is that human language does not follow the distribution of highest-probability next words. People regularly produce text that would not rank as the top prediction of a language model, and this mild unpredictability is what makes human writing engaging rather than bland. Beam search, by always selecting high-probability sequences, tends to produce text that is generic and repetitive, especially for open-ended generation tasks. Holtzman et al. showed that nucleus sampling produces text whose statistical properties (distribution of n-gram frequencies, entropy at each position, vocabulary diversity) match human writing far better than beam search, even though the beam search outputs have higher likelihood under the model [5].
A well-known problem with standard beam search is its bias toward shorter sequences. Because the score of a sequence is the sum of log-probabilities (all negative or zero), longer sequences accumulate more negative terms and therefore have lower raw scores. Without correction, beam search systematically prefers short outputs, and in extreme cases the empty translation receives the highest score [13].
Length normalization addresses this by dividing the sequence score by a function of its length. The simplest form divides the cumulative log-probability by the sequence length:
score(Y) = (1 / |Y|) * sum_t log P(y_t | y_<t, x)
This converts the score to an average log-probability per token, which removes the length bias entirely but can over-correct, leading the search to prefer overly long sequences. The Google Neural Machine Translation (GNMT) system by Wu et al. (2016) introduced a more flexible formulation:
score(Y) = log P(Y | X) / lp(Y)
where lp(Y) = ((5 + |Y|) / 6)^alpha is the length penalty term. The hyperparameter alpha controls the strength of the normalization: alpha = 0 applies no normalization, while alpha = 1 applies full normalization. Values around 0.6 to 0.8 are commonly used in practice; the GNMT paper itself reports alpha = 0.2 as optimal in some configurations and notes that the right value depends on the language pair and beam width [6].
Wu et al. (2016) also introduced a coverage penalty for machine translation, which encourages the decoder to attend to all parts of the source sentence. Defined as:
cp(X, Y) = beta * sum_i log min(sum_j p_{i,j}, 1.0)
where p_{i,j} is the attention weight from target position j to source position i, the coverage penalty sums the log of the minimum of the accumulated attention weights and 1.0 for each source position, penalizing hypotheses that leave parts of the input untranslated. The hyperparameter beta controls the strength of this penalty. Together with length normalization, the coverage penalty improved BLEU scores by roughly 1 point in the GNMT system, and beta values around 0.2 are reported as effective [6].
Several alternative length normalization schemes appear in the literature. The Hugging Face Transformers length_penalty parameter is implemented as an exponential penalty: scores are divided by length^length_penalty, with values greater than 1.0 favoring longer sequences and values less than 1.0 favoring shorter ones, with a default of 1.0 [16]. Murray and Chiang (2018) proposed a learned length prediction module that predicts the expected output length from the source and uses it to normalize scores. Meister, Cotterell, and Vieira (2020) reframed length-related corrections as instances of a more general regularized decoding objective motivated by the cognitive science principle of uniform information density (UID), where speakers tend to spread information evenly across an utterance. Their analysis suggests that the length and coverage penalties popularized by GNMT happen to enforce something close to UID, which may explain why they work better than the raw model score [14].
Over the years, numerous variants of beam search have been proposed to address its limitations. The table below lists the most commonly cited variants.
| variant | key idea | reference |
|---|---|---|
| Standard beam search | Top-K hypotheses each step, ordered by sum of log-probabilities | Lowerre (1976) |
| Greedy decoding | Beam search with K = 1 | trivial special case |
| Length-normalized beam search | Divides score by a function of length | Wu et al. (2016) |
| Coverage-augmented beam search | Adds attention coverage penalty for machine translation | Wu et al. (2016) |
| Diverse beam search | Adds inter-group diversity penalty | Vijayakumar et al. (2016) |
| Group beam search | Special case of diverse beam search with disjoint groups | Vijayakumar et al. (2016) |
| Constrained beam search | Restricts outputs to satisfy lexical constraints | Anderson et al. (2017) |
| Trie-constrained beam search | Uses a trie to enforce dictionary or vocabulary constraints | various |
| Stochastic beam search | Uses Gumbel-top-k for probabilistic beam selection without replacement | Kool et al. (2019) |
| Best-first beam search | Uses a priority queue to return same beams faster (up to 10x) | Meister, Cotterell, Vieira (2020) |
| Beam-stack search | Adds backtracking to beam search for completeness | Zhou & Hansen (2005) |
| Iterative beam search | Refines beams through multiple passes with growing K | various |
| Speculative beam search | Combines speculative decoding with beam search | various |
| Topic-aware beam search | Conditions diversity on topic embeddings for dialogue and summarization | various 2018-2020 |
| Creative beam search | Uses an LLM-as-judge to score beams for creative writing | Zhao et al. (2024) |
| Nucleus beam search | Combines nucleus sampling with beam search structure | Shaham & Levy (2022) |
Standard beam search tends to produce a set of hypotheses that are nearly identical, often differing by only one or two tokens. This lack of diversity is problematic for applications that need a ranked list of meaningfully different options, such as N-best translation, image captioning candidate sets, and reranking pipelines.
Diverse Beam Search (DBS), proposed by Vijayakumar et al. (2016), addresses this by dividing the beam of size K into G groups of size K/G each, and adding a diversity penalty between groups. At each step, the algorithm decodes one group at a time. When scoring candidates for group g, a penalty term is applied based on how similar those candidates are to the outputs already selected in groups 1 through g-1. This encourages each group to explore different regions of the output space [7].
The diversity penalty can be based on Hamming distance, n-gram overlap, or other similarity measures. DBS has been shown to improve not only the diversity of the output list but also the quality of the top-1 result in some settings, because the diversity constraint implicitly regularizes the search and prevents it from converging prematurely on a narrow part of the search space. Vijayakumar et al. demonstrated improvements on image captioning, machine translation, conversation, and visual question generation, with minimal computational or memory overhead compared to standard beam search [7].
Constrained beam search forces the decoder to satisfy hard constraints on the output, such as including specific words, satisfying a regular expression, or following a JSON schema. Anderson et al. (2017) introduced lexically constrained beam search for image captioning, where image tags detected by a convolutional neural network were treated as required tokens that must appear in the caption. The algorithm tracks the constraint satisfaction state of each beam and biases the search toward beams that satisfy more constraints, achieving state-of-the-art results on out-of-domain captioning on MSCOCO [15].
This idea has been extended in many directions. Trie-constrained beam search uses a prefix tree (trie) to enforce that outputs come from a fixed vocabulary, common in text-to-SQL and data-to-text generation. Grammar-constrained decoding compiles a context-free grammar into a finite state machine that masks invalid token continuations at each step, ensuring the output is syntactically valid JSON, SQL, or code. Modern LLM serving frameworks like vLLM, llama.cpp, and Outlines combine grammar masking with greedy or beam decoding to provide reliable structured output for tool use and function calling.
Stochastic beam search, introduced by Kool, van Hoof, and Welling (2019), combines beam search structure with sampling. The method extends the Gumbel-Max trick (which samples from a categorical distribution by adding Gumbel noise to log-probabilities and taking the argmax) to factorized distributions over sequences, producing exact samples without replacement using a single beam search pass. Stochastic beam search produces diverse hypotheses with theoretical guarantees and has been used to construct low-variance estimators for expected sentence-level BLEU score and model entropy [17].
Meister, Cotterell, and Vieira (2020) showed that standard beam search can be reformulated as a priority queue search that returns identical results but requires fewer scoring function calls. Their best-first beam search algorithm provably returns the same set of results as standard beam search, in the minimum number of model evaluations needed for optimality (modulo beam size). When the scoring function is monotonic in sequence length, the algorithm prunes hypotheses that cannot be in the final set early, achieving up to 10x speedup in practice. The authors also propose effective monotonic approximations of nonmonotonic scoring functions like length normalization and mutual information decoding, plus a memory-reduced variant that runs in a fraction of the time [18].
Minimum Bayes Risk (MBR) decoding has emerged as a compelling alternative that borrows ideas from both beam search and sampling. MBR generates multiple candidate translations (typically via sampling or via beam search), then selects the candidate that maximizes expected utility across all candidates under some similarity metric like BLEU, chrF, or COMET. Eikema and Aziz (2022) proposed sampling-based approximations to MBR that decouple the cost of exploration from the cost of robust expected utility estimation, enabling much larger hypothesis spaces than beam search supports. MBR with neural metrics like BLEURT or COMET has been shown to outperform beam search on several machine translation benchmarks, and some researchers view it as the natural successor to beam search in translation [10].
The most well-known criticism of beam search in the modern era is what Holtzman et al. (2019) called "neural text degeneration." When applied to open-ended generation tasks like story continuation or dialogue, beam search produces text that is bland, generic, and prone to repetition. The Holtzman paper showed striking examples where the same model, given the same prompt, produced engaging human-like text under nucleus sampling and degenerate repetitive output under beam search. The problem is not that the model is bad; it is that maximizing likelihood under the model selects sequences that are too statistically predictable to look like real human writing, which has more variation and surprise [5].
This observation reframed beam search's relationship to language modeling. For tasks with relatively low entropy outputs (translation, summarization, transcription), the highest-likelihood sequence under a good model is also a reasonable output, so beam search works well. For high-entropy tasks (story writing, dialogue, brainstorming), the highest-likelihood sequence is a poor proxy for what humans want, and beam search systematically picks the wrong region of the output distribution. The blandness problem is the practical reason that modern chat-tuned LLMs default to sampling rather than beam decoding.
Meister, Cotterell, and Vieira (2020) offered a complementary explanation in their paper "If beam search is the answer, what was the question?" They show that beam search implicitly enforces a uniform information density (UID) property on the output, a constraint motivated by psycholinguistic theory that says speakers tend to space surprising tokens evenly across an utterance. When MAP decoding is reframed as exact decoding with a UID-regularized objective, beam search behavior is exactly what falls out [14]. This explains why beam search works well for translation (where UID is a reasonable target) and poorly for free-form generation (where humans deliberately violate UID for emphasis).
Beam search has been the dominant decoding method in neural machine translation since the field's inception. The encoder-decoder architecture with attention, first popularized by Bahdanau et al. (2014) and the Transformer by Vaswani et al. (2017), produces translations one token at a time, and beam search provides the mechanism for finding high-quality output sequences.
Production translation systems including Google Translate, Microsoft Translator, DeepL, and Meta NLLB have historically relied on beam search with beam widths typically between 4 and 12. The deterministic nature of beam search is an advantage in translation, where users expect consistent output for the same input. Length normalization and coverage penalties are standard enhancements in this setting [6]. The Hugging Face Transformers library, the OpenNMT toolkit, and Meta's Fairseq all use beam search by default for their translation models, with K = 5 being the most common choice in published benchmarks.
In automatic speech recognition (ASR), beam search is used to decode the output of acoustic models. The decoder must find the most likely word sequence given the audio input, and beam search provides an efficient way to explore the space of possible transcriptions.
Classic ASR systems applied beam search within the Viterbi algorithm framework, limiting the number of active states at each time step. Modern end-to-end ASR models, such as those based on Connectionist Temporal Classification (CTC) or attention-based encoder-decoder architectures, use beam search at the token or word level. Beam widths in ASR tend to be larger than in text generation, often ranging from 10 to 100, because the acoustic signal introduces more ambiguity than text-conditioned generation [8]. Hannun et al. (2014) introduced CTC prefix beam search with shallow language model fusion, where an external n-gram or neural language model contributes a weighted score at each beam expansion. This approach reduced the word error rate from 35.8% to 14.1% on a benchmark task and underlies modern systems including Baidu's Deep Speech 2 and many open-source ASR pipelines [19].
Libraries like pyctcdecode (Kensho), NVIDIA's NeMo, and SpeechBrain all provide CTC beam search decoders with optional language model fusion. Word-level beam search variants like CTC Word Beam Search constrain outputs to a fixed vocabulary using a trie, which is useful for medical transcription, code dictation, and other domain-restricted ASR.
Beam search is widely used in image captioning models, where a visual encoder (typically a convolutional neural network or vision transformer) produces a representation of an image, and an autoregressive decoder generates a textual caption. Beam search helps produce grammatically correct and content-appropriate captions.
However, beam search in image captioning has been criticized for producing generic and repetitive captions. Because the model tends to assign high probability to safe, common descriptions (for example, "a dog sitting on the grass"), beam search reinforces this tendency. Recent work on meshed context-aware beam search (MCBS) addresses this by dynamically adjusting the attention mechanism at each decoding step to focus on different image regions, producing more diverse and contextually appropriate captions [9]. Constrained beam search is also commonly used in this domain to force the inclusion of detected objects in the caption [15].
In AI code generation, beam search has been used to produce syntactically valid programs by maintaining multiple candidate completions. The structured nature of programming languages makes beam search particularly useful, as invalid syntax can be detected and pruned early. Some code generation systems combine beam search with constraint decoding, where only syntactically valid token extensions are considered at each step. That said, modern code LLMs like Codex, Code Llama, and DeepSeek-Coder more commonly use temperature sampling combined with self-consistency or test-based filtering, because the diversity of sampled programs improves pass rates more than beam search does on benchmarks like HumanEval and MBPP.
For JSON mode, function calling, and tool use, beam search competes with greedy decoding under grammar constraints. Beam search with grammar masking can sometimes find better-scoring valid outputs than greedy with grammar masking, but the latency cost is hard to justify when the constraint already collapses the output space significantly. Most production LLM serving stacks default to greedy or low-temperature sampling for structured output, reserving beam search for cases where the structured output also needs to optimize a quality metric (for example, in code generation with test-based reranking).
Beam search is rarely used for dialogue and chat in modern LLMs because it suffers from the blandness problem in this exact setting. ChatGPT, Claude, Gemini, and most other major chat models use temperature sampling with top-p filtering by default. Earlier dialogue systems built on smaller seq2seq models (Google's Meena, Facebook's Blender) used beam search with diversity penalties or n-gram blocking, but modern systems have largely moved on.
The shift from beam search to sampling-based methods in modern large language models reflects several converging factors.
First, the primary use case for LLMs has shifted. Machine translation and summarization, where beam search excels, are no longer the dominant applications. Conversational AI, creative writing, brainstorming, and open-ended question answering require output that is varied, engaging, and natural-sounding. Beam search's tendency to produce the single most probable sequence works against these goals [5].
Second, modern LLMs are substantially larger and better trained than the models of the mid-2010s. With billions to trillions of parameters trained on trillions of tokens, these models have much sharper probability distributions over plausible continuations. The gap between greedy or beam search and sampling narrows as model quality improves, because the model's top predictions are already good.
Third, beam search is computationally expensive at inference time. For a model with hundreds of billions of parameters, each forward pass is costly, and maintaining K parallel beams multiplies that cost. Sampling with temperature, top-k, or top-p requires only a single forward pass per token per request, making it much more efficient for serving millions of users. The KV cache memory cost, which dominates large-model inference, also scales linearly with the number of beams.
Fourth, beam search is deterministic, which is a liability for conversational applications. Users expect different responses to the same prompt across sessions, and developers want to be able to control the randomness of outputs. Sampling provides this control natively through the temperature parameter.
Fifth, research has shown that maximizing the probability of the output sequence (which is what beam search does) is not the same as maximizing output quality as judged by humans. The "likelihood trap" leads beam search to produce outputs that are probable but not necessarily good, a problem that worsens with very large beam widths [2, 13].
As a result, the default decoding strategy for models like GPT-4, Claude, Gemini, and LLaMA is temperature-controlled sampling with top-p or top-k filtering, not beam search. Reasoning-focused models like OpenAI o1 and DeepSeek-R1 use sampling combined with parallel inference and majority voting (self-consistency), which has supplanted beam search even for tasks like math and code where beam search was once natural.
Researchers have identified several well-documented pathologies of beam search.
Cohen and Beck (2019) demonstrated that in neural machine translation, increasing beam width beyond a moderate value (around 5 to 10) often degrades translation quality as measured by BLEU score. This counterintuitive result occurs because the model's length bias, exposure bias, and other training artifacts become more pronounced as the search becomes more thorough. The "optimal" translation under the model's scoring function is not the translation that humans prefer [2].
Stahlberg and Byrne (2019) extended this finding by implementing exact inference for neural sequence models using depth-first search seeded by beam search. They showed that beam search fails to find the global model best in most cases even with K = 100, but more importantly that the global best is often not what users want: for more than half of the sentences in a standard NMT model, the global best score under the model is assigned to the empty translation. This means that vanilla NMT models are pathologically miscalibrated and that beam search's search errors are essential to its empirical success [13].
Beam search frequently produces repetitive text, especially in language modeling (as opposed to translation). Sequences that contain repeating n-grams can accumulate high scores because the model learns that repeating a phrase is often locally probable. Various n-gram blocking heuristics have been proposed: for example, preventing any n-gram from appearing more than once in the output. While effective, these are patches rather than principled solutions. Hugging Face Transformers exposes this as the no_repeat_ngram_size and repetition_penalty parameters, which work with beam search and sampling alike.
Without length normalization, beam search can produce very short or even empty outputs, because the <eos> token often has a relatively high probability early in decoding. The cumulative score of a short sequence can exceed that of a longer, better sequence simply because fewer negative log-probability terms have been accumulated. This is the same length bias that motivates the length penalty in GNMT and the sampling-based MBR alternatives that have emerged since.
Neural sequence models are usually trained with teacher forcing (the model sees ground-truth previous tokens during training) but at inference time must condition on its own predictions. This mismatch, called exposure bias, means that small errors compound: once a beam wanders off the training distribution, the model's probability estimates become unreliable. Beam search can amplify exposure bias relative to greedy because it explores more paths, some of which are off-distribution.
Beam search is implemented in most major deep learning and NLP frameworks.
Hugging Face Transformers provides beam search through the generate() method with the num_beams parameter. Setting num_beams=1 is equivalent to greedy decoding; values greater than 1 enable beam search. Related parameters include length_penalty (exponential length penalty, default 1.0), early_stopping (whether to stop when all beams hit <eos>), no_repeat_ngram_size (blocks repeated n-grams), num_beam_groups (enables diverse beam search), diversity_penalty (penalty between beam groups), and num_return_sequences (how many beams to return, must be less than or equal to num_beams) [16]. Hugging Face also supports constrained beam search via the constraints argument, which accepts disjunctive lexical constraints.
OpenNMT includes beam search with configurable length normalization, coverage penalty, and stepwise penalty functions, closely following the formulations of Wu et al. (2016).
Fairseq (Meta's sequence modeling toolkit) implements beam search with various scoring modifications and has been used extensively in machine translation research, including for the No Language Left Behind (NLLB) and SeamlessM4T multilingual translation systems.
CTC beam search decoders for speech recognition are available in libraries such as pyctcdecode, NVIDIA's NeMo, SpeechBrain, and Mozilla's DeepSpeech, with optional language model fusion through shallow fusion.
For LLM serving, vLLM, Text Generation Inference (TGI), and TensorRT-LLM all support beam search but treat it as a niche feature. The default decoding modes target the high-throughput sampling case because that matches how most chat applications are built. Beam search support in these stacks tends to lag behind sampling features and is sometimes excluded from the most aggressive batching paths because the K-fold replication of KV cache complicates continuous batching.
def beam_search(model, x, K, max_len, length_alpha=0.6):
beams = [(0.0, [BOS_ID])] # (cumulative log-prob, tokens)
finished = []
for t in range(max_len):
candidates = []
for score, tokens in beams:
if tokens[-1] == EOS_ID:
finished.append((score, tokens))
continue
log_probs = model.next_token_log_probs(x, tokens) # shape [V]
topk_ids = topk_indices(log_probs, K)
for tid in topk_ids:
candidates.append((score + log_probs[tid], tokens + [tid]))
if not candidates:
break
# keep top K candidates
candidates.sort(key=lambda c: c[0], reverse=True)
beams = candidates[:K]
finished.extend(beams)
def length_normalized(score, tokens):
lp = ((5 + len(tokens)) / 6) ** length_alpha
return score / lp
return max(finished, key=lambda c: length_normalized(*c))
This sketch leaves out batching across beams (essential in practice for GPU efficiency), KV-cache reuse across beam expansions (critical for transformer latency), and edge cases around early stopping and finished-beam scoring. Real implementations like Hugging Face's BeamSearchScorer handle these details and add support for diverse and constrained variants.
The nominal cost of beam search is K times the cost of greedy. In practice, the actual slowdown depends heavily on the deployment context.
On batched GPU inference for a single request, the K beams are stacked into a single batch dimension and the model runs once per step. If the batch is otherwise small, K = 4 might add only modest latency overhead because the GPU is underutilized at small batch sizes. As batch size grows past the GPU's optimal point, beam search costs more in proportion to K because the math becomes compute-bound.
For production LLM serving with continuous batching (vLLM, TGI), beam search is harder to integrate because the K-fold expansion of each request changes the dynamic batch composition. Each beam's KV cache must be tracked separately, and when beams diverge their KV caches must be copied or shared with copy-on-write semantics. This is one of several reasons beam search is rarely the default for LLM chat serving even though the underlying frameworks support it.
Memory cost scales linearly with K. For a transformer with L layers and hidden size d, the KV cache for a sequence of length T is approximately 2 * L * T * d * sizeof(dtype) bytes per beam. With K beams this becomes K times that, which can be the dominant cost for large models with long contexts. This is why beam search is more practical in shorter-context tasks like translation than in long-context tasks like document summarization or code completion in large repos.
Finding the exact MAP sequence under an autoregressive model is NP-hard in the worst case, even though each per-step decision is cheap. Beam search is a polynomial-time approximation that gives no formal quality guarantee. Stahlberg and Byrne (2019) constructed a depth-first search procedure that finds the actual model argmax for moderate-length sentences, and showed that beam search rarely matches this exact argmax even with K = 100. But, as discussed earlier, the model argmax is often a degenerate output (frequently the empty translation), so the fact that beam search misses it is not a defect but a feature. This is the central paradox of NMT decoding: vanilla NMT models are sufficiently miscalibrated that beam search's search errors are essential to producing useful output, and "better" search algorithms can produce worse translations [13, 14].
The practical implication is that beam search is best understood not as an approximation to the model's argmax but as a heuristic that combines the model's local predictions with implicit priors (length, coverage, uniform information density) that compensate for the model's training-time miscalibration. Recent work on MBR decoding tries to bypass this paradox by selecting outputs based on expected utility under a different metric (BLEU, COMET), avoiding the model's score altogether [10].
As of 2026, beam search occupies a well-defined niche in the broader landscape of decoding methods. For tasks where output fidelity and determinism are paramount (machine translation, speech recognition, structured data generation, code generation with syntactic constraints), beam search remains the method of choice. Production translation systems continue to use it, ASR systems rely on it for accurate transcription, and grammar-constrained decoding for tool use often pairs greedy or beam search with grammar masking.
For open-ended language generation, which now represents the majority of LLM use cases, sampling methods dominate. The trend toward ever-larger models has also reduced the relative benefit of beam search, since better models require less search to produce good outputs. Reasoning models like o1, o3, and DeepSeek-R1 use parallel sampling and self-consistency rather than beam search even for math and code, where beam search was once standard.
Recent research directions include combining beam search with reranking (generating a diverse set of candidates via sampling, then using beam search or a separate model to select the best one), using beam search within speculative decoding frameworks for faster inference, applying constrained beam search to ensure outputs satisfy format requirements in structured generation tasks, and developing principled alternatives like MBR decoding. The Meister-Cotterell line of work on best-first beam search and uniform-information-density-regularized decoding has reframed beam search theoretically and produced practical speedups [14, 18]. The Eikema-Aziz and follow-up work on sampling-based MBR has produced an alternative that outperforms beam on neural translation metrics like COMET and BLEURT, and is increasingly the recommended decoder in research-grade NMT [10].
Beam search will likely remain the backbone of NMT and ASR for the foreseeable future, but its share of overall decoding compute is shrinking as LLM chat and reasoning workloads, which prefer sampling, dominate inference budgets.