Candidate generation is the first stage in a multi-stage recommendation system or information retrieval pipeline. Its purpose is to quickly narrow a large collection of items (often millions or billions) down to a smaller set of hundreds or thousands of potentially relevant candidates. These candidates are then passed to a subsequent ranking stage, which applies more computationally expensive models to score and order them for final presentation to the user. The candidate generation stage prioritizes recall and speed over precision, while the ranking stage does the opposite.
Candidate generation is used extensively in production systems at companies such as Google, Meta, Netflix, Amazon, Spotify, Pinterest, LinkedIn, and Airbnb. The concept also appears in natural language processing (for passage retrieval in question answering systems) and in search engines (for document retrieval).
Imagine you have a giant toy store with a million toys, and you want to find the ten best toys for your friend's birthday. You can't look at every single toy one by one because that would take forever. So first, you quickly walk through the store and grab a cart full of about 100 toys that seem like they might be good. You pick based on simple rules: your friend likes dinosaurs, so you grab dinosaur toys; your friend likes the color blue, so you grab blue toys. That's candidate generation. Then you sit down with those 100 toys and carefully compare them to pick the best ten. That second, slower step is the ranking stage. Candidate generation is the fast, broad sweep that makes the careful comparison possible.
Modern production recommendation systems are almost always built as multi-stage pipelines, sometimes called recommendation funnels. The reason is practical: applying a complex ranking model to every item in a catalog of hundreds of millions would be far too slow for real-time serving. Instead, the system is broken into stages that progressively refine the set of items.
A typical pipeline has three or four stages:
| Stage | Purpose | Typical input size | Typical output size | Speed priority |
|---|---|---|---|---|
| Candidate generation (retrieval) | Identify broadly relevant items from the full catalog | Millions to billions | Hundreds to low thousands | Very high |
| Pre-ranking (optional) | Apply a lightweight model to trim candidates further | Hundreds to thousands | Hundreds | High |
| Ranking (scoring) | Score and order candidates with a rich, feature-heavy model | Hundreds | Tens | Moderate |
| Re-ranking | Apply business rules, diversity constraints, and policy filters | Tens | Final slate (5 to 50) | Lower |
At the candidate generation stage, the system may run multiple retrieval sources in parallel. For example, YouTube's recommendation system generates candidates from several sources simultaneously, including a collaborative filtering source, a content-based source, and sources that retrieve trending or recently uploaded content. The union of all retrieved candidates is then passed to the ranking stage.
The candidate generation and ranking stages have fundamentally different design goals.
Candidate generation focuses on recall: it must ensure that relevant items are not missed, because any item that is not retrieved at this stage can never appear in the final recommendation. The models used here are intentionally simpler and faster. A typical approach uses embedding lookups followed by approximate nearest neighbor (ANN) search, which can return results in sub-millisecond time even over billion-scale indexes.
The ranking stage focuses on precision. It uses richer features, including cross-features between user and item (for example, "how many times has this user clicked on items from this category in the past week"), and applies more complex models such as deep neural networks with hundreds of features. This stage can afford to be slower because it only processes a few hundred items rather than the full catalog.
There are several families of techniques for generating candidates. In practice, production systems often combine multiple approaches and merge their outputs.
Content-based filtering recommends items that are similar to items a user has previously engaged with. The system builds a profile of the user's preferences based on item features (such as genre, author, keywords, or visual attributes) and then retrieves items with matching or similar features.
Strengths:
Limitations:
Collaborative filtering generates candidates by finding patterns in user-item interactions. The core idea is that users who agreed on items in the past are likely to agree on other items in the future.
There are two main variants:
User-based collaborative filtering identifies users with similar interaction histories and recommends items that those similar users engaged with. It computes cosine similarity or Pearson correlation between user vectors, identifies the K nearest neighbor users, and then suggests items that those neighbors rated highly.
Item-based collaborative filtering focuses on item relationships. It computes similarity between items based on which users interacted with them, and then recommends items similar to those the target user has already engaged with. Item-based approaches tend to be more stable over time because item-item similarities change less frequently than user-user similarities.
Strengths:
Limitations:
Matrix factorization (MF) decomposes the sparse user-item interaction matrix into two lower-dimensional matrices: one for user embeddings and one for item embeddings. The predicted interaction score for a user-item pair is computed as the dot product of their respective embedding vectors.
Formally, the feedback matrix A is approximated as:
A ≈ U × V^T
where U is the user embedding matrix and V is the item embedding matrix. The dot product of user embedding u_i and item embedding v_j predicts how likely user i is to interact with item j.
Two common optimization algorithms are used to learn the embeddings:
| Algorithm | Strengths | Limitations |
|---|---|---|
| Stochastic Gradient Descent (SGD) | Flexible, supports various loss functions, parallelizable | Slower convergence, difficulty handling unobserved entries |
| Weighted Alternating Least Squares (WALS) | Faster convergence, handles weighted objectives well, distributable | Limited to squared loss formulations |
Matrix factorization gained widespread attention during the Netflix Prize competition in 2006-2009. Several advanced variants exist, including SVD++, which combines explicit ratings with implicit feedback signals, and Non-Negative Matrix Factorization (NMF), which constrains embeddings to non-negative values for improved interpretability.
Deep learning has become the dominant approach for candidate generation in large-scale systems. The most widely adopted architecture is the two-tower model (also called a dual encoder).
The two-tower model consists of two separate neural networks, called towers, that produce embedding vectors for queries (users) and candidates (items) independently. The relevance score between a query and a candidate is computed as the dot product (or cosine similarity) of their embedding vectors.
Query tower: Encodes user features such as interaction history, demographics, device type, and time context into a dense user embedding vector.
Item tower: Encodes item features such as item ID, category, text description, and image features into a dense item embedding vector.
The architecture has a specific property: user features and item features do not interact until the final dot product. This separation is intentional. It allows the item embeddings to be precomputed offline and stored in an ANN index. At serving time, only the query tower runs online to produce the user embedding, which is then used to query the precomputed index. This asymmetric design enables sub-millisecond retrieval from catalogs containing hundreds of millions of items.
The two-tower model is used at YouTube, Google, Meta, Airbnb, Expedia, Glassdoor, and many other companies. Google's TensorFlow Recommenders (TFRS) library provides a reference implementation.
Training considerations: Two-tower models are typically trained with a softmax cross-entropy loss over in-batch negatives or sampled negatives. Careful negative sampling is important because the model must learn to distinguish relevant items from irrelevant ones, and naive random sampling can lead to easy negatives that do not provide useful learning signals.
One of the most influential candidate generation systems was described in the 2016 paper "Deep Neural Networks for YouTube Recommendations" by Paul Covington, Jay Adams, and Emre Sargin at Google. The paper introduced a deep neural network approach to candidate generation that treated the problem as an extreme multi-class classification task.
The model takes as input a user's watch history (represented as averaged video ID embeddings), search history, demographic features, and geographic features. It feeds these through several fully connected layers and outputs a probability distribution over the entire video corpus using a sampled softmax (since the full softmax over millions of videos would be computationally infeasible).
At serving time, the model produces a user embedding vector, which is used to retrieve the nearest video embeddings via ANN search. This paper demonstrated that deep learning could substantially outperform traditional methods for candidate generation at scale and has been cited over 5,000 times.
Pinterest developed PinSage, a graph neural network (GNN) for candidate generation at web scale. PinSage uses random-walk-based graph convolutions to learn embeddings for pins (images) on a graph with over 3 billion nodes and 18 billion edges. The learned embeddings capture both content features and graph structure (which pins appear on the same boards).
PinSage generates candidates by performing nearest neighbor lookup in the learned embedding space. In experiments, PinSage outperformed the best baseline by 40% in recall and 22% in Mean Reciprocal Rank (MRR). The model uses importance-based neighborhood sampling and MapReduce-based inference to generate embeddings for billions of nodes within a few hours.
Once candidate generation models produce embeddings for users and items, the system needs to quickly find the items whose embeddings are closest to the query embedding. Exact nearest neighbor search has O(n) complexity and is too slow for large catalogs. Approximate nearest neighbor (ANN) algorithms trade a small amount of accuracy for large speedups, typically achieving sub-millisecond query times over billion-scale indexes.
ANN algorithms fall into three main categories:
| Category | Method | How it works | Trade-offs |
|---|---|---|---|
| Tree-based | KD-trees, Annoy | Recursively partition the embedding space into regions using hyperplane splits | Fast for low dimensions, degrades in high dimensions |
| Hash-based | Locality-Sensitive Hashing (LSH) | Hashes similar vectors into the same buckets with high probability, then searches only within matching buckets | Memory-efficient, but lower recall than graph methods |
| Graph-based | HNSW | Builds a multi-layer navigable small world graph where each node connects to nearby nodes; search navigates the graph from coarse to fine layers | State-of-the-art recall/speed trade-off, higher memory usage |
| Quantization-based | Product Quantization (PQ), ScaNN | Compresses vectors into compact codes to enable fast approximate distance computation | Memory-efficient, good for very large indexes |
FAISS (Facebook AI Similarity Search): An open-source library developed by Meta for efficient similarity search and clustering of dense vectors. FAISS supports multiple indexing strategies including IVF (Inverted File Index), HNSW, and Product Quantization. It provides GPU acceleration for faster indexing and querying and is widely used in production recommendation systems.
ScaNN (Scalable Nearest Neighbors): Developed by Google Research, ScaNN uses anisotropic vector quantization optimized for maximum inner product search (MIPS). It combines search space pruning via k-means partitioning with learned quantization. ScaNN powers Google's Vertex AI Vector Search and is available as open-source software.
HNSW (Hierarchical Navigable Small Worlds): An algorithm (not a library) that builds a multi-layer proximity graph. Each layer contains a subset of the data points, with higher layers being sparser. Search starts at the top layer and navigates downward, using greedy routing at each level. HNSW consistently achieves some of the best recall-speed trade-offs in benchmarks and is implemented in FAISS, Hnswlib, and many vector databases.
Annoy (Approximate Nearest Neighbors Oh Yeah): Developed by Spotify, Annoy uses random projection trees to partition the space. It is designed to be memory-efficient and supports memory-mapped files, making it suitable for use cases where the index must be shared across processes.
Candidate generation also plays a central role in information retrieval (IR), particularly in search engines and question answering (QA) systems. In IR, the task is often called "first-stage retrieval" or "document retrieval."
Traditional IR systems use sparse retrieval methods based on term-matching statistics. The most widely used algorithm is BM25 (Best Matching 25), which extends TF-IDF with document length normalization and term frequency saturation. BM25 scores documents based on how well they match the query terms, using an inverted index for fast lookup.
BM25 remains a strong baseline for candidate generation in search. Its strengths include fast retrieval via inverted indexes, good performance on exact keyword matching, effective handling of rare terms, and interpretable scoring. Its main limitation is that it cannot capture semantic similarity between terms that do not share surface forms.
Dense retrieval models such as Dense Passage Retrieval (DPR) and Contriever use transformer-based encoders to map queries and documents into dense embedding vectors. Candidate documents are retrieved by finding the nearest embeddings to the query embedding, typically using ANN search.
Dense retrieval captures semantic similarity, so a query about "car maintenance" can match a document about "automobile repair" even though the terms differ. However, dense retrieval can struggle with rare entities and exact term matching, where BM25 excels.
Many modern systems combine sparse and dense retrieval for candidate generation. A common pattern is to run BM25 and a dense retrieval model in parallel, merge their candidate sets, and pass the combined set to a re-ranker. This hybrid approach captures both lexical and semantic matching signals. Hybrid retrieval is widely used in retrieval-augmented generation (RAG) pipelines for large language models.
Both content-based and collaborative filtering candidate generation methods rely on embedding spaces, where users and items are represented as vectors in a shared low-dimensional space. The choice of similarity measure affects which candidates are retrieved.
| Similarity measure | Formula | Properties |
|---|---|---|
| Cosine similarity | cos(u, v) = (u . v) / (|u| |v|) | Measures angle between vectors; insensitive to vector magnitude; commonly used when embedding norms are not meaningful |
| Dot product | u . v | Sensitive to both angle and magnitude; popular items with larger embedding norms can dominate results |
| Euclidean distance | |u - v|_2 | Measures geometric distance; useful when magnitude carries meaning |
Google's ML recommendation course notes that "the dot product similarity is sensitive to the norm of the embedding," which can cause frequently occurring or popular items to overshadow more semantically relevant but less popular items. Some systems address this by normalizing embeddings before indexing.
Candidate generation models consume several types of features:
Dense features are continuous numerical values such as ratings, timestamps, prices, or popularity scores. These are fed directly into neural network layers.
Sparse features are categorical values such as item IDs, genres, user IDs, or geographic locations. These are high-cardinality and must be converted to dense representations through embedding layers. A vocabulary of item IDs may contain millions of entries, each mapped to an embedding vector of 10 to 100 dimensions.
Multivalent features represent variable-length sets. For example, a user's watch history is a set of video IDs of varying length. The standard approach is to look up the embedding for each ID and average them into a single fixed-width vector. This averaging step reduces dimensionality and mitigates noise.
The cold start problem arises when there is insufficient data to generate good candidates for new users or new items.
New user cold start: A user who has just joined the platform has no interaction history, so collaborative filtering and personalized embedding models have no signal to work with. Common solutions include:
New item cold start: A newly added item has no interaction data, so collaborative filtering cannot surface it. Solutions include:
Content-based filtering is naturally more robust to the item cold start problem because it relies on item features rather than interaction history.
Candidate generation systems are evaluated differently from ranking systems because the goals differ. The primary concern is whether the relevant items are present in the retrieved set, not their exact ordering.
| Metric | What it measures | Why it matters for candidate generation |
|---|---|---|
| Recall@K | Fraction of relevant items that appear in the top K retrieved candidates | Directly measures whether the system retrieves the items that matter; the primary offline metric |
| Hit rate@K | Whether at least one relevant item appears in the top K | Useful when only one item needs to be retrieved correctly |
| Coverage | Fraction of the catalog that appears in recommendations across all users | Measures whether the system avoids over-concentrating on popular items |
| Latency (p50, p99) | Time to retrieve candidates for a single query | Candidate generation must complete in single-digit milliseconds to meet serving SLAs |
| Throughput (QPS) | Queries per second the system can handle | Determines infrastructure cost and scalability |
There is a fundamental trade-off between recall and latency. Increasing the number of ANN probes or the number of retrieved candidates improves recall but increases latency. Production systems tune these parameters to find the best operating point for their specific requirements.
YouTube's recommendation system uses multiple candidate generation sources running in parallel. Each source retrieves a few hundred candidates using different signals (watch history, search history, trending videos, subscriptions). The union of candidates from all sources is sent to a deep ranking network. The system serves over a billion users and must generate candidates in real time.
Pinterest uses PinSage, a graph neural network trained on a bipartite graph of pins and boards. PinSage generates embeddings for over 3 billion pins and retrieves candidates through nearest neighbor search in the embedding space. The system also uses visual similarity and text-based retrieval as additional candidate sources.
Instagram's Explore recommendation system uses a multi-stage funnel. The candidate generation stage retrieves candidates from multiple sources, including account-based and media-based sources. These candidates pass through a three-stage ranking funnel: a distilled model returns the top 150, a lightweight neural network trims to the top 50, and a full deep neural network selects the final 25.
Google's search engine uses a multi-stage retrieval pipeline where BM25 and dense retrieval models generate candidate documents from a web-scale index. These candidates are then re-ranked by more sophisticated models such as BERT-based re-rankers. The candidate generation stage must handle trillions of documents with response times under 200 milliseconds.
| Approach | Speed | Cold start handling | Serendipity | Feature requirements | Typical scale |
|---|---|---|---|---|---|
| Content-based filtering | Fast | Good (uses item features) | Low | High (needs item metadata) | Small to medium |
| User-based collaborative filtering | Moderate | Poor (needs user history) | Moderate | Low | Small to medium |
| Item-based collaborative filtering | Moderate | Poor (needs item history) | Moderate | Low | Medium |
| Matrix factorization | Fast (after training) | Poor | Moderate | Low | Medium to large |
| Two-tower deep model | Very fast (at serving) | Moderate (can use content features) | Moderate | Flexible | Very large (billions) |
| Graph neural networks | Fast (after training) | Moderate | High | Needs graph structure | Very large (billions) |
| BM25 (sparse retrieval) | Very fast | Good (uses text features) | Low | Text required | Very large (trillions) |
| Dense retrieval | Fast | Moderate | Moderate | Text or features required | Large |
Use multiple retrieval sources. Combine collaborative, content-based, and popularity-based sources to improve recall and diversity. Most production systems run several candidate generators in parallel and merge their outputs.
Optimize for recall, not precision. At the candidate generation stage, missing a relevant item is worse than including an irrelevant one. The ranking stage will filter out irrelevant candidates.
Monitor catalog coverage. If the candidate generation stage consistently retrieves only popular items, the system will develop a popularity bias. Track the fraction of the catalog that appears in recommendations.
Tune ANN parameters carefully. The number of probes, the number of clusters, and the number of retrieved neighbors all affect the recall-latency trade-off. Use offline benchmarks to find the right operating point.
Handle cold start explicitly. Have fallback strategies for new users and new items. Blending content-based and popularity-based signals helps when interaction data is sparse.
Keep serving latency low. Candidate generation should complete in single-digit milliseconds for real-time applications. Precompute item embeddings, use efficient ANN indexes, and batch requests where possible.
Update embeddings regularly. User preferences and item relevance change over time. Re-train models and rebuild ANN indexes on a regular schedule (daily or more frequently for rapidly changing catalogs).