Candidate Generation

Candidate generation is the first stage in a multi-stage recommendation system or information retrieval pipeline. Its purpose is to quickly narrow a large collection of items (often millions or billions) down to a smaller set of hundreds or thousands of potentially relevant candidates. These candidates are then passed to a subsequent ranking stage, which applies more computationally expensive models to score and order them for final presentation to the user. The candidate generation stage prioritizes recall and speed over precision, while the ranking stage does the opposite.

Candidate generation is used extensively in production systems at companies such as Google, Meta, Netflix, Amazon, Spotify, Pinterest, LinkedIn, and Airbnb. The concept also appears in natural language processing (for passage retrieval in question answering systems) and in search engines (for document retrieval).

ELI5 (Explain like I'm 5)

Imagine you have a giant toy store with a million toys, and you want to find the ten best toys for your friend's birthday. You can't look at every single toy one by one because that would take forever. So first, you quickly walk through the store and grab a cart full of about 100 toys that seem like they might be good. You pick based on simple rules: your friend likes dinosaurs, so you grab dinosaur toys; your friend likes the color blue, so you grab blue toys. That's candidate generation. Then you sit down with those 100 toys and carefully compare them to pick the best ten. That second, slower step is the ranking stage. Candidate generation is the fast, broad sweep that makes the careful comparison possible.

Role in multi-stage recommendation systems

Modern production recommendation systems are almost always built as multi-stage pipelines, sometimes called recommendation funnels. The reason is practical: applying a complex ranking model to every item in a catalog of hundreds of millions would be far too slow for real-time serving. Instead, the system is broken into stages that progressively refine the set of items.

A typical pipeline has three or four stages:

Stage	Purpose	Typical input size	Typical output size	Speed priority
Candidate generation (retrieval)	Identify broadly relevant items from the full catalog	Millions to billions	Hundreds to low thousands	Very high
Pre-ranking (optional)	Apply a lightweight model to trim candidates further	Hundreds to thousands	Hundreds	High
Ranking (scoring)	Score and order candidates with a rich, feature-heavy model	Hundreds	Tens	Moderate
Re-ranking	Apply business rules, diversity constraints, and policy filters	Tens	Final slate (5 to 50)	Lower

At the candidate generation stage, the system may run multiple retrieval sources in parallel. For example, YouTube's recommendation system generates candidates from several sources simultaneously, including a collaborative filtering source, a content-based source, and sources that retrieve trending or recently uploaded content. The union of all retrieved candidates is then passed to the ranking stage.

Candidate generation vs. ranking

The candidate generation and ranking stages have fundamentally different design goals.

Candidate generation focuses on recall: it must ensure that relevant items are not missed, because any item that is not retrieved at this stage can never appear in the final recommendation. The models used here are intentionally simpler and faster. A typical approach uses embedding lookups followed by approximate nearest neighbor (ANN) search, which can return results in sub-millisecond time even over billion-scale indexes.

The ranking stage focuses on precision. It uses richer features, including cross-features between user and item (for example, "how many times has this user clicked on items from this category in the past week"), and applies more complex models such as deep neural networks with hundreds of features. This stage can afford to be slower because it only processes a few hundred items rather than the full catalog.

Approaches to candidate generation

There are several families of techniques for generating candidates. In practice, production systems often combine multiple approaches and merge their outputs.

Content-based filtering

Content-based filtering recommends items that are similar to items a user has previously engaged with. The system builds a profile of the user's preferences based on item features (such as genre, author, keywords, or visual attributes) and then retrieves items with matching or similar features.

Strengths:

Works for new items immediately, since it relies on item features rather than interaction history
Does not require data from other users
Captures niche or specialized preferences

Limitations:

Requires manual feature engineering or well-structured item metadata
Tends to recommend items similar to what the user has already seen, limiting discovery
Cannot capture preferences that are not reflected in available features

Collaborative filtering

Collaborative filtering generates candidates by finding patterns in user-item interactions. The core idea is that users who agreed on items in the past are likely to agree on other items in the future.

There are two main variants:

User-based collaborative filtering identifies users with similar interaction histories and recommends items that those similar users engaged with. It computes cosine similarity or Pearson correlation between user vectors, identifies the K nearest neighbor users, and then suggests items that those neighbors rated highly.

Item-based collaborative filtering focuses on item relationships. It computes similarity between items based on which users interacted with them, and then recommends items similar to those the target user has already engaged with. Item-based approaches tend to be more stable over time because item-item similarities change less frequently than user-user similarities.

Strengths:

Does not require item feature engineering
Can surface unexpected or serendipitous recommendations
Works well when user behavior data is abundant

Limitations:

Suffers from the cold start problem for new users and new items
Struggles with users who have very sparse interaction histories
Computationally expensive at scale without optimization

Matrix factorization

Matrix factorization (MF) decomposes the sparse user-item interaction matrix into two lower-dimensional matrices: one for user embeddings and one for item embeddings. The predicted interaction score for a user-item pair is computed as the dot product of their respective embedding vectors.

Formally, the feedback matrix A is approximated as:

A ≈ U × V^T

where U is the user embedding matrix and V is the item embedding matrix. The dot product of user embedding u_i and item embedding v_j predicts how likely user i is to interact with item j.

Two common optimization algorithms are used to learn the embeddings:

Algorithm	Strengths	Limitations
Stochastic Gradient Descent (SGD)	Flexible, supports various loss functions, parallelizable	Slower convergence, difficulty handling unobserved entries
Weighted Alternating Least Squares (WALS)	Faster convergence, handles weighted objectives well, distributable	Limited to squared loss formulations

Matrix factorization gained widespread attention during the Netflix Prize competition in 2006-2009. Several advanced variants exist, including SVD++, which combines explicit ratings with implicit feedback signals, and Non-Negative Matrix Factorization (NMF), which constrains embeddings to non-negative values for improved interpretability.

Deep learning approaches

Deep learning has become the dominant approach for candidate generation in large-scale systems. The most widely adopted architecture is the two-tower model (also called a dual encoder).

Two-tower model

The two-tower model consists of two separate neural networks, called towers, that produce embedding vectors for queries (users) and candidates (items) independently. The relevance score between a query and a candidate is computed as the dot product (or cosine similarity) of their embedding vectors.

Query tower: Encodes user features such as interaction history, demographics, device type, and time context into a dense user embedding vector.

Item tower: Encodes item features such as item ID, category, text description, and image features into a dense item embedding vector.

The architecture has a specific property: user features and item features do not interact until the final dot product. This separation is intentional. It allows the item embeddings to be precomputed offline and stored in an ANN index. At serving time, only the query tower runs online to produce the user embedding, which is then used to query the precomputed index. This asymmetric design enables sub-millisecond retrieval from catalogs containing hundreds of millions of items.

The two-tower model is used at YouTube, Google, Meta, Airbnb, Expedia, Glassdoor, and many other companies. Google's TensorFlow Recommenders (TFRS) library provides a reference implementation.

Training considerations: Two-tower models are typically trained with a softmax cross-entropy loss over in-batch negatives or sampled negatives. Careful negative sampling is important because the model must learn to distinguish relevant items from irrelevant ones, and naive random sampling can lead to easy negatives that do not provide useful learning signals.

YouTube's deep candidate generation model

One of the most influential candidate generation systems was described in the 2016 paper "Deep Neural Networks for YouTube Recommendations" by Paul Covington, Jay Adams, and Emre Sargin at Google. The paper introduced a deep neural network approach to candidate generation that treated the problem as an extreme multi-class classification task.

The model takes as input a user's watch history (represented as averaged video ID embeddings), search history, demographic features, and geographic features. It feeds these through several fully connected layers and outputs a probability distribution over the entire video corpus using a sampled softmax (since the full softmax over millions of videos would be computationally infeasible).

At serving time, the model produces a user embedding vector, which is used to retrieve the nearest video embeddings via ANN search. This paper demonstrated that deep learning could substantially outperform traditional methods for candidate generation at scale and has been cited over 5,000 times.

Graph neural networks

Pinterest developed PinSage, a graph neural network (GNN) for candidate generation at web scale. PinSage uses random-walk-based graph convolutions to learn embeddings for pins (images) on a graph with over 3 billion nodes and 18 billion edges. The learned embeddings capture both content features and graph structure (which pins appear on the same boards).

PinSage generates candidates by performing nearest neighbor lookup in the learned embedding space. In experiments, PinSage outperformed the best baseline by 40% in recall and 22% in Mean Reciprocal Rank (MRR). The model uses importance-based neighborhood sampling and MapReduce-based inference to generate embeddings for billions of nodes within a few hours.

Approximate nearest neighbor (ANN) search

Once candidate generation models produce embeddings for users and items, the system needs to quickly find the items whose embeddings are closest to the query embedding. Exact nearest neighbor search has O(n) complexity and is too slow for large catalogs. Approximate nearest neighbor (ANN) algorithms trade a small amount of accuracy for large speedups, typically achieving sub-millisecond query times over billion-scale indexes.

ANN algorithm families

ANN algorithms fall into three main categories:

Category	Method	How it works	Trade-offs
Tree-based	KD-trees, Annoy	Recursively partition the embedding space into regions using hyperplane splits	Fast for low dimensions, degrades in high dimensions
Hash-based	Locality-Sensitive Hashing (LSH)	Hashes similar vectors into the same buckets with high probability, then searches only within matching buckets	Memory-efficient, but lower recall than graph methods
Graph-based	HNSW	Builds a multi-layer navigable small world graph where each node connects to nearby nodes; search navigates the graph from coarse to fine layers	State-of-the-art recall/speed trade-off, higher memory usage
Quantization-based	Product Quantization (PQ), ScaNN	Compresses vectors into compact codes to enable fast approximate distance computation	Memory-efficient, good for very large indexes

Major ANN libraries and systems

FAISS (Facebook AI Similarity Search): An open-source library developed by Meta for efficient similarity search and clustering of dense vectors. FAISS supports multiple indexing strategies including IVF (Inverted File Index), HNSW, and Product Quantization. It provides GPU acceleration for faster indexing and querying and is widely used in production recommendation systems.

ScaNN (Scalable Nearest Neighbors): Developed by Google Research, ScaNN uses anisotropic vector quantization optimized for maximum inner product search (MIPS). It combines search space pruning via k-means partitioning with learned quantization. ScaNN powers Google's Vertex AI Vector Search and is available as open-source software.

HNSW (Hierarchical Navigable Small Worlds): An algorithm (not a library) that builds a multi-layer proximity graph. Each layer contains a subset of the data points, with higher layers being sparser. Search starts at the top layer and navigates downward, using greedy routing at each level. HNSW consistently achieves some of the best recall-speed trade-offs in benchmarks and is implemented in FAISS, Hnswlib, and many vector databases.

Annoy (Approximate Nearest Neighbors Oh Yeah): Developed by Spotify, Annoy uses random projection trees to partition the space. It is designed to be memory-efficient and supports memory-mapped files, making it suitable for use cases where the index must be shared across processes.

Candidate generation in information retrieval

Candidate generation also plays a central role in information retrieval (IR), particularly in search engines and question answering (QA) systems. In IR, the task is often called "first-stage retrieval" or "document retrieval."

Sparse retrieval

Traditional IR systems use sparse retrieval methods based on term-matching statistics. The most widely used algorithm is BM25 (Best Matching 25), which extends TF-IDF with document length normalization and term frequency saturation. BM25 scores documents based on how well they match the query terms, using an inverted index for fast lookup.

BM25 remains a strong baseline for candidate generation in search. Its strengths include fast retrieval via inverted indexes, good performance on exact keyword matching, effective handling of rare terms, and interpretable scoring. Its main limitation is that it cannot capture semantic similarity between terms that do not share surface forms.

Dense retrieval

Dense retrieval models such as Dense Passage Retrieval (DPR) and Contriever use transformer-based encoders to map queries and documents into dense embedding vectors. Candidate documents are retrieved by finding the nearest embeddings to the query embedding, typically using ANN search.

Dense retrieval captures semantic similarity, so a query about "car maintenance" can match a document about "automobile repair" even though the terms differ. However, dense retrieval can struggle with rare entities and exact term matching, where BM25 excels.

Hybrid retrieval

Many modern systems combine sparse and dense retrieval for candidate generation. A common pattern is to run BM25 and a dense retrieval model in parallel, merge their candidate sets, and pass the combined set to a re-ranker. This hybrid approach captures both lexical and semantic matching signals. Hybrid retrieval is widely used in retrieval-augmented generation (RAG) pipelines for large language models.

Embedding spaces and similarity measures

Both content-based and collaborative filtering candidate generation methods rely on embedding spaces, where users and items are represented as vectors in a shared low-dimensional space. The choice of similarity measure affects which candidates are retrieved.

Similarity measure	Formula	Properties
Cosine similarity	cos(u, v) = (u . v) / (\|u\| \|v\|)	Measures angle between vectors; insensitive to vector magnitude; commonly used when embedding norms are not meaningful
Dot product	u . v	Sensitive to both angle and magnitude; popular items with larger embedding norms can dominate results
Euclidean distance	\|u - v\|_2	Measures geometric distance; useful when magnitude carries meaning

Google's ML recommendation course notes that "the dot product similarity is sensitive to the norm of the embedding," which can cause frequently occurring or popular items to overshadow more semantically relevant but less popular items. Some systems address this by normalizing embeddings before indexing.

Feature processing for candidate generation models

Candidate generation models consume several types of features:

Dense features are continuous numerical values such as ratings, timestamps, prices, or popularity scores. These are fed directly into neural network layers.

Sparse features are categorical values such as item IDs, genres, user IDs, or geographic locations. These are high-cardinality and must be converted to dense representations through embedding layers. A vocabulary of item IDs may contain millions of entries, each mapped to an embedding vector of 10 to 100 dimensions.

Multivalent features represent variable-length sets. For example, a user's watch history is a set of video IDs of varying length. The standard approach is to look up the embedding for each ID and average them into a single fixed-width vector. This averaging step reduces dimensionality and mitigates noise.

The cold start problem

The cold start problem arises when there is insufficient data to generate good candidates for new users or new items.

New user cold start: A user who has just joined the platform has no interaction history, so collaborative filtering and personalized embedding models have no signal to work with. Common solutions include:

Showing popular or trending items until interaction data accumulates
Using demographic or contextual features (device, location, time of day) for initial personalization
Asking users to select preferences during onboarding
Leveraging data from social network connections or linked accounts

New item cold start: A newly added item has no interaction data, so collaborative filtering cannot surface it. Solutions include:

Using content-based features (item metadata, text descriptions, images) to place the item in the embedding space
Employing feature-based embedding functions that estimate item embeddings from item attributes
Running exploration strategies that inject new items into candidate sets to collect initial feedback
Using large language models to generate synthetic interaction signals for cold-start items

Content-based filtering is naturally more robust to the item cold start problem because it relies on item features rather than interaction history.

Evaluation metrics

Candidate generation systems are evaluated differently from ranking systems because the goals differ. The primary concern is whether the relevant items are present in the retrieved set, not their exact ordering.

Metric	What it measures	Why it matters for candidate generation
Recall@K	Fraction of relevant items that appear in the top K retrieved candidates	Directly measures whether the system retrieves the items that matter; the primary offline metric
Hit rate@K	Whether at least one relevant item appears in the top K	Useful when only one item needs to be retrieved correctly
Coverage	Fraction of the catalog that appears in recommendations across all users	Measures whether the system avoids over-concentrating on popular items
Latency (p50, p99)	Time to retrieve candidates for a single query	Candidate generation must complete in single-digit milliseconds to meet serving SLAs
Throughput (QPS)	Queries per second the system can handle	Determines infrastructure cost and scalability

There is a fundamental trade-off between recall and latency. Increasing the number of ANN probes or the number of retrieved candidates improves recall but increases latency. Production systems tune these parameters to find the best operating point for their specific requirements.

Industry case studies

YouTube

YouTube's recommendation system uses multiple candidate generation sources running in parallel. Each source retrieves a few hundred candidates using different signals (watch history, search history, trending videos, subscriptions). The union of candidates from all sources is sent to a deep ranking network. The system serves over a billion users and must generate candidates in real time.

Pinterest

Pinterest uses PinSage, a graph neural network trained on a bipartite graph of pins and boards. PinSage generates embeddings for over 3 billion pins and retrieves candidates through nearest neighbor search in the embedding space. The system also uses visual similarity and text-based retrieval as additional candidate sources.

Instagram (Meta)

Instagram's Explore recommendation system uses a multi-stage funnel. The candidate generation stage retrieves candidates from multiple sources, including account-based and media-based sources. These candidates pass through a three-stage ranking funnel: a distilled model returns the top 150, a lightweight neural network trims to the top 50, and a full deep neural network selects the final 25.

Google Search

Google's search engine uses a multi-stage retrieval pipeline where BM25 and dense retrieval models generate candidate documents from a web-scale index. These candidates are then re-ranked by more sophisticated models such as BERT-based re-rankers. The candidate generation stage must handle trillions of documents with response times under 200 milliseconds.

Comparison of candidate generation approaches

Approach	Speed	Cold start handling	Serendipity	Feature requirements	Typical scale
Content-based filtering	Fast	Good (uses item features)	Low	High (needs item metadata)	Small to medium
User-based collaborative filtering	Moderate	Poor (needs user history)	Moderate	Low	Small to medium
Item-based collaborative filtering	Moderate	Poor (needs item history)	Moderate	Low	Medium
Matrix factorization	Fast (after training)	Poor	Moderate	Low	Medium to large
Two-tower deep model	Very fast (at serving)	Moderate (can use content features)	Moderate	Flexible	Very large (billions)
Graph neural networks	Fast (after training)	Moderate	High	Needs graph structure	Very large (billions)
BM25 (sparse retrieval)	Very fast	Good (uses text features)	Low	Text required	Very large (trillions)
Dense retrieval	Fast	Moderate	Moderate	Text or features required	Large

Best practices

Use multiple retrieval sources. Combine collaborative, content-based, and popularity-based sources to improve recall and diversity. Most production systems run several candidate generators in parallel and merge their outputs.
Optimize for recall, not precision. At the candidate generation stage, missing a relevant item is worse than including an irrelevant one. The ranking stage will filter out irrelevant candidates.
Monitor catalog coverage. If the candidate generation stage consistently retrieves only popular items, the system will develop a popularity bias. Track the fraction of the catalog that appears in recommendations.
Tune ANN parameters carefully. The number of probes, the number of clusters, and the number of retrieved neighbors all affect the recall-latency trade-off. Use offline benchmarks to find the right operating point.
Handle cold start explicitly. Have fallback strategies for new users and new items. Blending content-based and popularity-based signals helps when interaction data is sparse.
Keep serving latency low. Candidate generation should complete in single-digit milliseconds for real-time applications. Precompute item embeddings, use efficient ANN indexes, and batch requests where possible.
Update embeddings regularly. User preferences and item relevance change over time. Re-train models and rebuild ANN indexes on a regular schedule (daily or more frequently for rapidly changing catalogs).

References

Covington, P., Adams, J., & Sargin, E. (2016). "Deep Neural Networks for YouTube Recommendations." *Proceedings of the 10th ACM Conference on Recommender Systems (RecSys '16)*, pp. 191-198.
Ying, R., He, R., Chen, K., Eksombatchai, P., Hamilton, W. L., & Leskovec, J. (2018). "Graph Convolutional Neural Networks for Web-Scale Recommender Systems." *Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*, pp. 974-983.
Yi, X., Yang, J., Hong, L., Cheng, D. Z., Heldt, L., Kumthekar, A., Zhao, Z., Wei, L., & Chi, E. (2019). "Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations." *Proceedings of the 13th ACM Conference on Recommender Systems*.
Koren, Y., Bell, R., & Volinsky, C. (2009). "Matrix Factorization Techniques for Recommender Systems." *IEEE Computer*, 42(8), pp. 30-37.
Malkov, Y. A., & Yashunin, D. A. (2020). "Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs." *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 42(4), pp. 824-836.
Johnson, J., Douze, M., & Jegou, H. (2021). "Billion-Scale Similarity Search with GPUs." *IEEE Transactions on Big Data*, 7(3), pp. 535-547.
Guo, R., Sun, P., Lindgren, E., Geng, Q., Simcha, D., Chern, F., & Kumar, S. (2020). "Accelerating Large-Scale Inference with Anisotropic Vector Quantization." *Proceedings of the 37th International Conference on Machine Learning (ICML)*.
Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., & Yih, W. (2020). "Dense Passage Retrieval for Open-Domain Question Answering." *Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)*.
Robertson, S. E., & Zaragoza, H. (2009). "The Probabilistic Relevance Framework: BM25 and Beyond." *Foundations and Trends in Information Retrieval*, 3(4), pp. 333-389.
Google Developers. "Candidate Generation Overview." *Machine Learning Recommendation Course*. https://developers.google.com/machine-learning/recommendation/overview/candidate-generation
Wang, X., He, X., Wang, M., Feng, F., & Chua, T.-S. (2019). "Neural Graph Collaborative Filtering." *Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval*, pp. 165-174.
Rendle, S., Krichene, W., Zhang, L., & Anderson, J. (2020). "Neural Collaborative Filtering vs. Matrix Factorization Revisited." *Proceedings of the 14th ACM Conference on Recommender Systems*, pp. 240-248.
Meta Engineering. (2023). "Scaling the Instagram Explore Recommendations System." *Engineering at Meta Blog*. https://engineering.fb.com/2023/08/09/ml-applications/scaling-instagram-explore-recommendations-system/

ELI5 (Explain like I'm 5)

Role in multi-stage recommendation systems

Candidate generation vs. ranking

Approaches to candidate generation

Content-based filtering

Collaborative filtering

Matrix factorization

Deep learning approaches

Two-tower model

YouTube's deep candidate generation model

Graph neural networks

Approximate nearest neighbor (ANN) search

ANN algorithm families

Major ANN libraries and systems

Candidate generation in information retrieval

Sparse retrieval

Dense retrieval

Hybrid retrieval

Embedding spaces and similarity measures

Feature processing for candidate generation models

The cold start problem

Evaluation metrics

Industry case studies

YouTube

Pinterest

Instagram (Meta)

Google Search

Comparison of candidate generation approaches

Best practices

See also

References

Improve this article

Related Articles

Machine learning terms/Recommendation Systems

ARC-AGI 2

MTEB (Massive Text Embedding Benchmark)

Ranking

Recommendation System

Wide Model

ELI5 (Explain like I'm 5)

Role in multi-stage recommendation systems

Candidate generation vs. ranking

Approaches to candidate generation

Content-based filtering

Collaborative filtering

Matrix factorization

Deep learning approaches

Two-tower model

YouTube's deep candidate generation model

Graph neural networks

Approximate nearest neighbor (ANN) search

ANN algorithm families

Major ANN libraries and systems

Candidate generation in information retrieval

Sparse retrieval

Dense retrieval

Hybrid retrieval

Embedding spaces and similarity measures

Feature processing for candidate generation models

The cold start problem

Evaluation metrics

Industry case studies

YouTube

Pinterest

Instagram (Meta)

Google Search

Comparison of candidate generation approaches

Best practices

See also

References

Related Articles

Machine learning terms/Recommendation Systems

ARC-AGI 2

MTEB (Massive Text Embedding Benchmark)

Ranking

Recommendation System

Wide Model