Recommendation System

See also: Machine learning terms

A recommendation system (also called a recommender system) is an information filtering system that predicts a user's preferences and suggests relevant items. In machine learning terms, it is a type of algorithm that provides personalized suggestions or recommendations to users, typically in the context of digital platforms such as e-commerce websites, streaming services, social media platforms, job boards, and news aggregators. Recommendation systems draw on techniques from machine learning, data mining, and information retrieval to identify and rank items most likely to interest individual users based on their behavior, preferences, and context.

The commercial importance of recommendation engines is substantial. Netflix has estimated that its recommendation system saves the company over $1 billion per year in customer retention, while YouTube has reported that more than 70% of watch time originates from algorithmic recommendations. Amazon attributes roughly 35% of its revenue to its recommendation engine.

The study of recommendation systems gained widespread attention following the Netflix Prize competition (2006 to 2009), an open contest that offered a $1 million prize to any team that could improve Netflix's existing Cinematch recommendation algorithm by at least 10% in root mean squared error (RMSE). Netflix provided a training dataset of over 100 million ratings from roughly 480,000 users across 17,770 movies. The competition was won by the BellKor's Pragmatic Chaos team, which achieved a 10.06% improvement by combining over 100 models using ensemble methods, blending matrix factorization with neighborhood-based approaches and time-based adjustments. Although Netflix never fully deployed the winning solution due to engineering complexity, the competition popularized benchmarking datasets, open competitions, and collaborative filtering research.

Explain like I'm 5 (ELI5)

Imagine you walk into a huge library with millions of books, and you have no idea what to read next. A recommendation system is like a really smart librarian who remembers every book you have ever read, knows which ones you liked, and also knows what thousands of other people with similar tastes enjoyed. Based on all of that information, the librarian picks out a short stack of books you will probably love. That is essentially what Netflix, Spotify, and Amazon do when they suggest movies, songs, or products to you. The system uses fancy math and computer stuff to figure out what you might like best, making your life easier and more fun.

Historical development

The history of recommendation systems spans more than three decades, evolving from simple rule-based filters to sophisticated deep learning architectures.

Year	Milestone	Significance
1992	Tapestry system (Xerox PARC)	First system to use the term "collaborative filtering"; relied on manual annotations from a small community of users to filter email and Usenet posts
1994	GroupLens	Automated collaborative filtering for Usenet news articles using user ratings; one of the first academic recommender systems
1998	Amazon.com launches recommendations	Early large-scale deployment of item-to-item collaborative filtering in e-commerce
2003	Linden, Smith, and York publish Amazon's item-to-item paper	Described the scalable item-to-item collaborative filtering algorithm; became one of the most cited papers in the field
2006	Netflix Prize announced	Offered $1 million for a 10% improvement over Netflix's Cinematch algorithm; attracted over 40,000 teams and catalyzed research in matrix factorization and ensemble methods
2009	BellKor's Pragmatic Chaos wins Netflix Prize	Winning solution combined more than 100 models, blending matrix factorization with neighborhood methods and temporal dynamics
2016	YouTube deep neural network paper (Covington et al.)	Introduced the influential two-stage candidate generation and ranking architecture using deep neural networks at massive scale
2017	Neural Collaborative Filtering (He et al.)	Replaced the traditional inner product with neural network architectures capable of learning nonlinear user-item interactions
2018	SASRec (Kang and McAuley)	Applied self-attention mechanisms to sequential recommendation, outperforming RNN-based approaches
2019	BERT4Rec (Sun et al.)	Introduced bidirectional self-attention with masked item prediction for sequential recommendation
2020	LightGCN (He et al.)	Simplified graph convolution for recommendation by removing unnecessary operations, achieving strong performance with minimal complexity
2023+	LLM-based recommendation systems	Large language models integrated into recommendation pipelines as feature encoders, rankers, and conversational agents

Types of recommendation systems

There are three primary types of recommendation systems: content-based filtering, collaborative filtering, and hybrid approaches. A fourth family, knowledge-based recommendation, is used in domains where interaction data is scarce.

Content-based filtering

Content-based filtering recommends items similar to those a user has previously engaged with. The system analyzes item attributes (genre, director, keywords, price range, actors) and builds a user profile reflecting their preferences. For textual content, techniques such as TF-IDF (Term Frequency-Inverse Document Frequency) weighting and word embeddings are commonly used to represent items as vectors, and similarity between items is then computed using measures such as cosine similarity, Euclidean distance, or Jaccard similarity.

Advantages: Does not require data from other users; can explain recommendations transparently ("because you liked X"); effective for new items with rich metadata.

Disadvantages: Limited diversity since recommendations cluster around past preferences (a phenomenon known as the "filter bubble" or "overspecialization" problem); depends heavily on the quality of feature engineering; struggles to capture nuanced taste patterns that emerge from user behavior.

The Music Genome Project, which powers Pandora's radio service, is a well-known example. Each song is annotated with up to 450 attributes by trained musicologists, and recommendations are generated based on attribute similarity.

Collaborative filtering

Collaborative filtering assumes that users who agreed in the past will agree in the future. Rather than analyzing item features, it relies on patterns in user-item interactions (ratings, clicks, purchases).

User-based collaborative filtering

This approach identifies users with similar interaction histories and recommends items that those similar users preferred. Similarity is typically measured with Pearson correlation, cosine similarity, or other distance metrics over rating vectors. The predicted rating for an item is often computed as a weighted average of the ratings given by similar users, where the weights correspond to user similarity scores. GroupLens (1994) was among the first systems to implement this approach.

Limitation: User similarity is unstable and expensive to compute at scale because the number of users grows continuously.

Item-based collaborative filtering

This method computes similarity between items based on how users collectively interact with them. If many users who liked item A also liked item B, those items are considered similar. When a user likes item A, the system recommends item B.

Item-based approaches are more practical in production because item similarity tends to be more stable than user similarity, enabling precomputation and caching. Amazon's item-to-item collaborative filtering, published by Linden, Smith, and York in 2003, pioneered this technique at scale and became the foundation of Amazon's "Customers who bought this also bought" feature.

Hybrid approaches

Hybrid systems combine content-based and collaborative filtering to overcome the limitations of each individual method. Most contemporary production systems use hybrid architectures. Netflix, for instance, uses a combination of collaborative filtering, content-based analysis, and deep learning models to generate its recommendations. Common combination strategies include:

Strategy	Description	Example
Weighted hybrid	Combines scores from multiple recommenders using learned or fixed weights	Netflix blending collaborative and content scores
Switching hybrid	Selects between approaches based on context (e.g., falls back to content-based when collaborative data is sparse)	New user receives content-based recs until enough interaction data accumulates
Feature augmentation	Uses the output of one recommender as input features for another	Collaborative filtering embeddings fed into a content-based model
Cascade hybrid	A first-stage recommender generates candidates; a second-stage recommender re-ranks them	Two-tower retrieval followed by a deep ranking model
Mixed hybrid	Presents recommendations from different approaches side by side	Showing "popular in your area" alongside "because you watched"

Knowledge-based recommendations

Knowledge-based systems use explicit domain knowledge and user requirements rather than historical interaction data. They are particularly valuable in domains where purchases are infrequent (cars, real estate, financial products) and collaborative filtering cannot accumulate enough data.

Two main variants exist. Constraint-based systems filter items that satisfy a set of user-specified constraints (e.g., "budget under $30,000, fuel efficiency above 30 mpg"). Case-based systems retrieve items similar to a reference case and allow iterative critique-based refinement (e.g., "like this laptop but with more RAM").

Knowledge-based approaches do not suffer from cold start problems but require significant upfront effort to encode domain rules and ontologies.

Matrix factorization

Matrix factorization is a class of collaborative filtering algorithms that decomposes the user-item interaction matrix into the product of two lower-dimensional matrices, representing users and items as vectors in a shared latent factor space. The predicted preference of a user for an item is approximated by the dot product of their respective vectors.

Formally, given a user-item rating matrix R of dimensions m x n (where m is the number of users and n is the number of items), matrix factorization finds two matrices P (of dimensions m x k) and Q (of dimensions n x k), where k is the number of latent factors, such that:

R ≈ P × Q^T

The latent factors may capture properties like genre preferences or quality perception, though they are not always directly interpretable. Matrix factorization gained prominence during the Netflix Prize, where it was a core component of the winning solution.

Singular value decomposition (SVD)

Singular Value Decomposition (SVD) is a foundational matrix factorization technique in linear algebra that decomposes a matrix into three components: U, Σ, and V^T, where U and V are orthogonal matrices and Σ is a diagonal matrix containing singular values. In recommender systems, a truncated SVD retains only the top k singular values to produce a low-rank approximation that captures the most important latent factors.

In practice, the "SVD" commonly used in recommender systems is often called "Funk SVD" (named after Simon Funk, a Netflix Prize competitor). It is not a true singular value decomposition; instead, it learns factor matrices directly through stochastic gradient descent by minimizing reconstruction error on observed ratings, with added regularization to prevent overfitting. This approach handles missing values naturally and scales to large datasets.

Alternating least squares (ALS)

Alternating Least Squares (ALS) solves the factorization problem by alternating between fixing the user matrix and solving for the item matrix, and vice versa. Each step reduces to a least squares problem with a closed-form solution. ALS is well suited for large-scale, sparse datasets and popular in distributed computing environments because each user update and each item update can be computed independently, making the algorithm highly parallelizable. Apache Spark's MLlib library provides a widely used ALS implementation. ALS also handles implicit feedback naturally through the weighted matrix factorization formulation introduced by Hu, Koren, and Volinsky in 2008.

Non-negative matrix factorization (NMF)

Non-Negative Matrix Factorization (NMF) constrains all elements of the factor matrices to be non-negative, which can lead to more interpretable latent factors. Because the factors are non-negative, they can be thought of as additive parts, where each user's preference is a sum of positive contributions from different latent components.

Bayesian personalized ranking (BPR)

Bayesian Personalized Ranking (BPR) is a learning framework for matrix factorization specifically designed for implicit feedback data (clicks, views, purchases) rather than explicit ratings. BPR optimizes for the correct ranking of items rather than rating prediction accuracy, using a pairwise loss function that encourages observed (interacted) items to be ranked higher than unobserved items.

Deep learning approaches

Deep learning has significantly advanced recommendation capabilities since the mid-2010s, driven by the success of neural networks in areas such as computer vision and natural language processing. Deep learning models can learn complex, nonlinear relationships between users and items, automatically extract features from raw data, and incorporate heterogeneous data sources such as text, images, audio, and interaction sequences.

Neural collaborative filtering (NCF)

Proposed by Xiangnan He and colleagues in 2017 (published at WWW '17), NCF is a general framework that replaces the traditional inner product used in matrix factorization with neural network architectures capable of learning arbitrary interaction functions from data. NCF demonstrates that matrix factorization can be expressed as a special case within its framework. The framework includes three model instantiations:

Model	Description	Key Idea
Generalized Matrix Factorization (GMF)	A generalization of standard MF where the element-wise product of user and item embeddings is passed through an output layer	Captures linear interactions between latent factors
Multi-Layer Perceptron (MLP)	Concatenates user and item embeddings and passes them through multiple fully connected layers	Learns nonlinear interactions through hidden layers
Neural Matrix Factorization (NeuMF)	Combines GMF and MLP by concatenating their respective last hidden layers before the final prediction layer	Jointly models both linear and nonlinear feature interactions

NCF is optimized using binary cross-entropy loss with negative sampling, targeting implicit feedback (clicks, views) rather than explicit ratings. Experiments on the MovieLens and Pinterest datasets showed substantial improvements over traditional matrix factorization baselines.

Wide and deep learning

Introduced by Heng-Tze Cheng and colleagues at Google in 2016, this architecture jointly trains a wide linear model (for memorization of specific feature combinations) alongside a deep neural network (for generalization through dense embeddings).

Wide component: A generalized linear model that takes raw input features and manually engineered cross-product feature transformations. This component excels at memorizing specific, sparse feature combinations (for example, "user installed app X AND was shown app Y").
Deep component: A standard feed-forward neural network that takes dense embeddings learned for categorical features along with normalized dense features. This component generalizes to unseen feature combinations through its learned low-dimensional representations.

The two components are trained jointly using mini-batch stochastic gradient descent, with the wide component trained via FTRL (Follow-the-Regularized-Leader) and the deep component trained via AdaGrad. The model was deployed in production at the Google Play app store, where it increased app acquisitions by 3.9% compared to a wide-only model in live A/B tests. The implementation was open-sourced in TensorFlow.

Deep learning recommendation model (DLRM)

Meta's DLRM (published 2019) is a neural network architecture designed specifically for personalization and recommendation. DLRM processes two types of input features:

Dense (continuous) features: Fed into a bottom MLP that produces a dense vector of the same dimensionality as the categorical embeddings.
Sparse (categorical) features: Each categorical feature is mapped to a dense vector through an embedding table lookup.

A key innovation is computing pairwise dot products between all embedding pairs and the processed dense feature vector, mimicking factorization machines within a deep learning framework. These dot product interactions are concatenated with the processed dense features and passed through a top MLP followed by a sigmoid activation to output a probability (for example, click-through probability).

DLRM uses model parallelism for the embedding tables (which can consume hundreds of gigabytes of memory) and data parallelism for the fully connected layers, enabling training at massive scale on industrial recommendation workloads. Meta open-sourced DLRM implementations in both PyTorch and Caffe2.

DeepFM

DeepFM, proposed by Guo et al. in 2017, combines factorization machines (FM) with deep neural networks in a single end-to-end architecture, eliminating the need for manual feature engineering for the wide component. The FM component captures second-order feature interactions while sharing its embedding layer with the deep component, which models higher-order interactions. This sharing reduces model complexity and eliminates the need for separate feature engineering.

Deep and cross network (DCN)

DCN, introduced by Wang et al. at Google in 2017, replaces the wide component of Wide and Deep with a cross network that explicitly models bounded-degree feature interactions at each layer, with linear computational cost per layer. Each cross-network layer applies a cross operation that produces higher-order interactions. DCN-V2 (published in 2021) improved the architecture by introducing mixture-of-experts cross layers with low-rank parameterization for greater expressiveness.

Two-tower models

Two-tower (dual-encoder) models encode users and items into a shared embedding space using two independent neural networks (the "towers"). This architecture has become a cornerstone of large-scale industrial recommendation systems, particularly for the candidate generation stage.

Component	Function
User tower	Encodes user features (demographics, history, context) into a dense user embedding
Item tower	Encodes item features (metadata, content, popularity, category) into a dense item embedding
Scoring function	Computes relevance as the dot product or cosine similarity between user and item embeddings

The critical advantage of two-tower models is that item embeddings can be precomputed offline and indexed for fast approximate nearest neighbor (ANN) retrieval. At serving time, only the user embedding needs to be computed, and ANN algorithms (such as FAISS, ScaNN, or HNSW) retrieve the top-K most relevant items from the precomputed index in milliseconds even from catalogs containing billions of items.

Two-tower models are typically trained using contrastive loss or sampled softmax loss. YouTube's candidate generation system, described by Covington et al. in 2016, exemplifies this approach. It treats recommendation as an extreme multi-class classification problem, with the softmax output layer spanning millions of video classes.

Limitation: Because user and item features are encoded independently, two-tower models cannot capture fine-grained cross-feature interactions between users and items. They are therefore best suited for the retrieval stage rather than final ranking.

Sequential and session-based recommendation

Sequential models capture temporal dynamics by modeling the ordered sequence of a user's interactions, recognizing that interaction order reveals evolving preferences. Unlike traditional collaborative filtering, which treats each user's interactions as an unordered set, sequential models use order as valuable information.

Markov chain models

Early sequential approaches used Markov chains to model transition probabilities between items. First-order Markov chain models predict the next item based solely on the most recent interaction, while higher-order models consider multiple recent interactions. Factored Personalized Markov Chains (FPMC) combined first-order Markov chains with matrix factorization to capture both sequential patterns and general user preferences.

Recurrent neural networks

Recurrent neural networks (RNNs), particularly GRU-based models, were among the first deep learning approaches applied to sequential recommendation. GRU4Rec (Hidasi et al., 2016) applied gated recurrent units to session-based recommendation, processing sequences of item interactions to predict the next item. It introduced ranking-based loss functions (BPR loss and TOP1 loss) tailored to the recommendation setting and demonstrated strong performance on e-commerce click-stream data.

SASRec: self-attentive sequential recommendation

SASRec (Kang and McAuley, 2018, published at ICDM '18) applies a self-attention mechanism (inspired by the Transformer architecture) to sequential recommendation. Each item in a user's history attends to all previous items, with causal masking ensuring that predictions only use past interactions.

Key architectural components include:

Item embedding layer: Maps each item to a dense vector; embeddings are shared between the input and prediction layers.
Positional encoding: Learned positional embeddings are added to item embeddings to encode the order of the sequence.
Self-attention blocks: Stacked multi-head self-attention layers (the default configuration uses two blocks) with causal (left-to-right) masking.
Point-wise feed-forward layers: Applied after each self-attention layer, following the standard Transformer block design.
Prediction layer: Computes the dot product between the sequence representation and candidate item embeddings to score the next item.

SASRec outperforms both Markov chain and RNN-based approaches on sparse and dense datasets while offering faster training and inference through parallelization.

BERT4Rec

BERT4Rec (Sun et al., 2019) adapts the bidirectional self-attention architecture of BERT to recommendation. Rather than predicting only the next item given all previous items (left-to-right), it uses a Cloze task (masked item prediction) that randomly masks items in the sequence and trains the model to reconstruct them, allowing the model to leverage context from both directions during training. This bidirectional approach captures richer sequential patterns, though it requires a different inference strategy than unidirectional models.

Large language models and recommendation

The integration of large language models (LLMs) into recommendation systems has become a major research direction since 2023. LLMs bring extensive world knowledge, zero-shot reasoning capabilities, natural language understanding, and the ability to process textual item descriptions and user reviews directly. Research surveys have categorized LLM-based approaches into several paradigms:

Paradigm	Description	Strengths	Challenges
LLM as feature encoder	Uses LLM embeddings to represent items or user profiles from descriptions, reviews, and metadata	Rich semantic representations; captures nuance in text	High computational cost for encoding large catalogs
LLM as scoring/ranking model	Prompts the LLM with user history and candidate items to output preference scores or ranked lists	Zero-shot capability; can incorporate reasoning	Latency constraints; inconsistent output formatting
Conversational recommendation	LLM engages in multi-turn dialogue to elicit and refine preferences	Natural interaction; can explain recommendations	Hallucination risk; difficulty incorporating real-time signals
Retrieval-augmented generation (RAG)	Retrieves user history or item data and feeds it to the LLM for re-ranking or explanation	Grounds LLM outputs in actual data; reduces hallucination	Added pipeline complexity; retrieval quality bottleneck

Despite their promise, LLMs face practical challenges for real-time recommendation: inference latency is orders of magnitude higher than traditional models, they may hallucinate item details, and they struggle to incorporate rapidly changing behavioral signals. As a result, LLMs are currently most practical for offline catalog enrichment, explanation generation, cold-start mitigation, and re-ranking in later pipeline stages.

Graph neural networks for recommendation

Graph neural networks (GNNs) model the user-item interaction graph directly, propagating information through the graph structure to learn embeddings that capture high-order connectivity patterns.

PinSage (Ying et al., 2018), developed at Pinterest, applies graph convolutional networks to a bipartite graph with over 3 billion nodes and 18 billion edges. It uses random-walk-based neighborhood sampling to make computation tractable at web scale and demonstrated significant improvements in user engagement through online A/B tests.

LightGCN (He et al., 2020) simplified graph convolution for recommendation by removing self-connections, feature transformations, and nonlinear activations from the message-passing mechanism. It retains only the weighted sum of neighbor embeddings at each layer and combines representations from all layers through a simple summation. Despite its simplicity, LightGCN achieved state-of-the-art performance on multiple benchmarks.

NGCF (Neural Graph Collaborative Filtering) explicitly encodes the collaborative signal in the embedding propagation process, learning user and item embeddings by propagating them on the user-item bipartite graph.

Industry recommendation systems

Production systems at major technology companies typically employ multi-stage pipeline architectures to balance computational efficiency with recommendation quality.

Multi-stage pipeline architecture

Production recommendation systems generally follow a funnel-shaped pipeline with three or four stages:

Stage	Purpose	Candidate pool	Typical latency	Common models
Candidate generation (retrieval)	Quickly narrow millions or billions of items to hundreds or thousands of candidates	Millions to billions in, 500 to 5,000 out	5 to 30 ms	Two-tower models, ANN index (FAISS, ScaNN), inverted index
Pre-ranking (L1 ranking)	Further reduce candidates with lightweight models	500 to 5,000 in, 100 to 500 out	10 to 50 ms	Small neural networks, gradient-boosted trees
Ranking (L2 ranking)	Apply complex scoring models with rich features	100 to 500 in, 10 to 50 out	50 to 150 ms	DLRM, DCN-V2, Wide and Deep, cross-encoder models
Re-ranking	Apply business rules, diversity constraints, fairness adjustments, and freshness boosts	10 to 50 in, final list out	5 to 20 ms	Rule-based filters, contextual bandits

The total end-to-end latency target for interactive recommendation surfaces is typically under 200 milliseconds at the 95th or 99th percentile.

YouTube

YouTube's recommendation system, described in a landmark 2016 paper by Covington, Adams, and Sargin, uses deep neural networks in a two-stage architecture. The candidate generation network treats recommendation as extreme multi-class classification, producing a shortlist of hundreds of videos from a corpus of hundreds of millions. A separate deep ranking network then scores these candidates using hundreds of features, including watch history, search history, demographics, and video freshness. YouTube reports that over 70% of total watch time comes from algorithmic recommendations.

Netflix

Netflix employs a microservices-based recommendation architecture combining collaborative filtering, content-based analysis, and deep learning models. The system personalizes not only which titles to recommend but also which artwork to display for each title, optimizing thumbnail selection through multi-armed bandit experiments. Netflix's system processes data through Apache Kafka for real-time event streaming and stores results in Apache Cassandra.

Amazon

Amazon pioneered item-to-item collaborative filtering at scale, publishing its approach in 2003. The "Customers who bought this item also bought" feature became one of the most recognizable recommendation interfaces. Amazon's modern system integrates collaborative filtering with content-based methods and deep learning, personalizing results across product search, homepage recommendations, and email campaigns. Amazon has estimated that 35% of its revenue is driven by recommendations.

Spotify

Spotify uses a hybrid multi-model system combining collaborative filtering trained on approximately 700 million user-generated playlists, NLP analysis of music reviews and blog posts, and raw audio analysis through convolutional neural networks that extract features from spectrograms (tempo, key, energy, danceability). These signals are combined to build holistic representations of each track, enriched with mood, genre, and style tags. Discover Weekly, Spotify's personalized playlist feature, draws on all three approaches.

TikTok

TikTok's recommendation algorithm powers the For You Page (FYP), which is the primary content discovery mechanism on the platform. Unlike follower-based feeds, TikTok prioritizes content relevance over creator popularity, meaning even accounts with zero followers can reach large audiences. The algorithm analyzes thousands of signals per user, with video watch time being one of the strongest indicators of interest. Save and share actions carry more weight than likes. The system emphasizes real-time interest modeling and employs robust cold-start mechanisms for new users and new content, which is a key reason the platform has achieved rapid global growth.

Pinterest

Pinterest deployed PinSage, a graph convolutional network operating on a bipartite graph of pins and boards with billions of nodes. The system generates embeddings for visual discovery, powering related pin recommendations, homefeed personalization, and shopping recommendations.

LinkedIn

LinkedIn uses a hybrid deep learning system with a multi-objective optimization framework to balance engagement (clicks) and relevance across feed, job, and people recommendations.

Scalability

Serving recommendations at web scale (millions of users, millions of items) requires specialized infrastructure.

Approximate nearest neighbor (ANN) search

When items are represented as embedding vectors, finding the most similar items to a user embedding is a nearest neighbor search problem. Exact search over millions of vectors is too slow for real-time serving, so production systems use approximate nearest neighbor algorithms that trade a small amount of accuracy for orders-of-magnitude speedups.

FAISS (Facebook AI Similarity Search) is a widely used open-source library supporting multiple indexing strategies, including inverted file indices (IVF), product quantization, and HNSW (Hierarchical Navigable Small World) graphs. With GPU acceleration, FAISS can process over 1.5 million queries per second on billion-scale datasets.

ScaNN (Scalable Nearest Neighbors), developed by Google, uses anisotropic vector quantization optimized for maximum inner product search, which is the scoring function used by two-tower models.

Two-stage retrieval and ranking

The standard industry approach separates retrieval (finding a manageable set of candidates from the full catalog) from ranking (scoring those candidates with expensive models). This separation enables each stage to use architectures optimized for its specific constraints: retrieval models must be fast and operate over the full item space, while ranking models can be more complex because they only score hundreds of candidates.

Cold start problem

The cold start problem occurs when there is insufficient interaction data for accurate recommendations, affecting new users, new items, or entirely new systems. It is one of the most persistent challenges in recommendation systems and manifests in three forms.

Type	Description	Example
New user cold start	No historical interactions to infer preferences	A first-time Netflix subscriber
New item cold start	No user interactions to establish item popularity or similarity	A newly released film with zero ratings
System cold start	Minimal overall data when a platform first launches	A startup deploying its first recommendation feature

Solutions for new users

Popularity-based recommendations: Serve globally popular or trending items until sufficient behavior data accumulates.
Onboarding questionnaires: Explicitly elicit preferences during registration. Spotify, for instance, asks new users to select favorite artists and genres.
Demographic filtering: Use registration information (age, location, language) to match new users to similar demographic groups.
Multi-armed bandits: Systematically explore diverse items to learn a new user's preferences quickly while still providing reasonable recommendations.
Meta-learning: Train models that can rapidly adapt from just a few interactions, learning a good initialization that generalizes across users.
Transfer learning: Apply knowledge from a related domain (e.g., using music preferences to cold-start podcast recommendations).

Solutions for new items

Content-based features: Use item metadata (genre, description, creator, images) to compute similarity with established items.
Knowledge graphs: Leverage structured relationships between items (e.g., director, cast, and studio for movies) to relate new items to the existing catalog.
Side information: Incorporate auxiliary data such as text descriptions and categorical attributes to represent items even without interaction data.
Exploration strategies: Proactively display new items to a sample of users to gather initial interaction data.

Evaluation of recommendation systems

Evaluation requires a combination of offline metrics, online experiments, and qualitative assessment. No single metric captures all dimensions of recommendation quality, and offline performance does not always correlate with online user satisfaction.

Offline evaluation metrics

Metric	Type	Description
Precision@K	Set-based	Fraction of top-K recommendations that are relevant
Recall@K	Set-based	Fraction of all relevant items that appear in top-K
F1@K	Set-based	Harmonic mean of Precision@K and Recall@K
MAP (Mean Average Precision)	Rank-aware	Average of precision values computed at each position where a relevant item appears, averaged across users
MRR (Mean Reciprocal Rank)	Rank-aware	Average of the reciprocal rank of the first relevant item across users
NDCG@K	Rank-aware	Measures ranking quality with a logarithmic discount for position; accounts for graded relevance
Hit Rate@K	Set-based	Fraction of users who have at least one relevant item in their top-K
RMSE / MAE	Rating prediction	Root mean squared error or mean absolute error between predicted and actual ratings
AUC	Classification	Area under the ROC curve for binary relevance prediction

Rank-aware metrics (MAP, MRR, NDCG) are generally preferred over set-based metrics because they penalize models that place relevant items lower in the ranking.

Online evaluation

Online evaluation tests a recommendation model in a live environment using real users. A/B testing is the gold standard: a fraction of live traffic is routed to a new recommendation model while the rest continues using the baseline, and business metrics are compared.

Key online metrics include:

Click-through rate (CTR): Fraction of recommended items that users click on.
Conversion rate: Fraction of recommendations that lead to a desired action (purchase, subscription, sign-up).
Engagement metrics: Time spent, number of items consumed, return visits, session duration.
User satisfaction: Measured through surveys, ratings of recommendations, or implicit signals such as skip rates.
Revenue impact: Direct measurement of revenue changes attributed to the recommendation model.

A common finding is that improvements in offline metrics (such as NDCG) do not always translate to improvements in online metrics. This gap is why online evaluation remains essential before deploying new models. Interleaving is an alternative to A/B testing that requires smaller sample sizes: results from two models are merged into a single ranked list, and the model whose items receive more clicks is preferred.

Beyond accuracy

Modern evaluation increasingly considers dimensions beyond prediction accuracy:

Diversity: Variety of categories, genres, or topics in the recommendation list.
Novelty: The degree to which recommendations introduce items the user would not have discovered on their own.
Serendipity: Recommendations that are both relevant and surprising.
Coverage: The fraction of the item catalog that the system is capable of recommending across all users.
Fairness: Equitable recommendation quality across user groups and equitable exposure across item providers.

Bias and ethical considerations

Recommendation systems raise several ethical concerns that have attracted growing attention from researchers, regulators, and the public.

Popularity bias

Recommendation systems tend to disproportionately recommend popular items, creating a feedback loop where popular items receive more exposure, generating more interactions, which makes them even more popular. This Matthew effect harms long-tail content creators and reduces catalog diversity. While this bias partly reflects genuine user preferences, algorithms can amplify it beyond what the underlying data warrants. Mitigation techniques include inverse propensity scoring, regularization terms that penalize overrepresentation of popular items, and calibrated recommendations that match the distribution of a user's actual interests.

Filter bubbles and echo chambers

Algorithmic personalization can create filter bubbles, where users are only exposed to content that reinforces their existing preferences and viewpoints. Echo chambers emerge when this dynamic is amplified by social network effects, potentially contributing to political polarization. Research on the severity of these phenomena presents mixed conclusions: some studies find that algorithmic curation reduces exposure to ideologically diverse content, while others find that short-term exposure to filter-bubble recommendation systems has limited polarization effects. The feedback loop between user behavior and algorithmic selection is the underlying mechanism, as users within algorithmic bubbles are encouraged to behave in ways the algorithm expects, which further reinforces the bubble.

Fairness and discrimination

Recommendation systems can perpetuate or amplify societal biases related to race, gender, age, and other protected attributes. For example, a job recommendation system trained on historical hiring data may systematically underrecommend positions to candidates from underrepresented groups. Fairness concerns span two dimensions: user-side fairness (equal recommendation quality across demographic groups) and provider-side fairness (equitable exposure for different content creators or sellers). Techniques for addressing fairness include constrained optimization, fairness-aware re-ranking, and adversarial debiasing.

Privacy

Recommendation systems process extensive personal data, including browsing history, purchase records, location, and social connections. This data collection raises privacy concerns, particularly when users are not fully informed about how their data is used. Even collaborative filtering can indirectly reveal sensitive preferences through inference attacks. Privacy-preserving approaches include federated learning (training models on-device without centralizing data), differential privacy (adding calibrated noise to prevent individual identification), and on-device inference.

Manipulation and adversarial attacks

Shilling attacks inject fake user profiles or coordinated inauthentic behavior to manipulate recommendation outcomes, promoting or demoting specific items. Click farms and other coordinated inauthentic behaviors can also distort recommendation quality, particularly in contexts such as political content and product reviews. Defenses include anomaly detection, robust aggregation algorithms, and adversarial training.

Addiction and user well-being

There is growing concern that recommendation systems optimized for engagement metrics (such as time spent on a platform) can promote addictive usage patterns and surface harmful content. Platforms have faced criticism for recommending increasingly extreme content to maximize engagement, particularly on video and social media platforms. Some companies have introduced features such as screen time limits, content warnings, and "take a break" reminders in response to these concerns.

Regulation

Regulatory frameworks are increasingly addressing recommendation systems. The EU's Digital Services Act (DSA), effective 2024, requires large online platforms to provide transparency about recommendation algorithms and offer users at least one option not based on profiling. The EU AI Act classifies certain recommendation systems as high-risk AI systems subject to additional requirements, particularly those influencing elections or vulnerable populations. In the United States, proposed legislation like the Filter Bubble Transparency Act would mandate disclosure of personalization and provide non-personalized alternatives.

Applications beyond e-commerce

While e-commerce and entertainment are the most visible applications, recommendation systems serve many other domains:

Domain	Application	Example
News	Personalized article feeds and topic selection	Google News, Apple News
Music	Playlist generation, artist discovery, radio stations	Spotify Discover Weekly, Pandora
Job matching	Matching candidates to job postings based on skills, experience, and preferences	LinkedIn job recommendations
Social media	Friend suggestions, group recommendations, content feeds	Facebook People You May Know, Instagram Explore
Education	Personalized learning paths, course recommendations	Coursera, Khan Academy
Healthcare	Treatment option suggestions, drug interaction warnings	Clinical decision support systems
Travel	Hotel, flight, and activity recommendations	Booking.com, TripAdvisor
Finance	Personalized financial product recommendations, portfolio suggestions	Robo-advisors, banking apps
Academic research	Paper recommendations, citation suggestions	Google Scholar, Semantic Scholar
Advertising	Ad targeting and placement based on user profiles and context	Google Ads, Meta Ads

Comparison of approaches

Approach	Core technique	Strengths	Limitations	Typical use cases
Content-based filtering	Item feature similarity	No user data needed; transparent reasoning; handles new items	Limited diversity; requires feature engineering	News article recommendation, Pandora's Music Genome Project
User-based collaborative filtering	User similarity on ratings/behavior	Captures complex preferences; no feature engineering	Poor scalability; cold start; sparse data	Early Amazon, GroupLens
Item-based collaborative filtering	Item similarity from co-interactions	More stable than user-based; precomputable	Cold start for items; sparse data	Amazon "also bought" recommendations
Matrix factorization (SVD, ALS)	Low-rank decomposition of interaction matrix	Handles sparsity well; scalable with ALS; strong baselines	Linear interactions only; limited expressiveness	Netflix Prize, Spotify
Neural Collaborative Filtering (NCF)	Neural networks for user-item interaction	Nonlinear interactions; flexible architecture	Requires more data and compute; potential overfitting	Research benchmarks (MovieLens, Pinterest)
Wide and Deep	Joint linear + deep neural network	Memorization plus generalization	Some manual feature engineering needed	Google Play app recommendations
DLRM	Embedding tables + explicit feature interaction + MLPs	Handles heterogeneous features; massive scale	Memory-intensive; complex infrastructure	Meta feed and ad ranking
DeepFM	Factorization machine + deep network	Automatic feature interaction; no manual cross features	Higher compute cost than linear models	CTR prediction in advertising
Two-tower models	Separate user/item encoders with ANN retrieval	Very fast retrieval at billion-item scale	Cannot model cross-feature interactions	YouTube candidate generation, Google Search
SASRec / BERT4Rec	Self-attention over item sequences	Captures temporal dynamics; parallelizable	Requires sequential data; may not help for infrequent users	Session-based, e-commerce
Graph neural networks	Message passing on user-item bipartite graph	Models high-order connectivity; incorporates side information	Scalability on very large graphs; complex training	PinSage (Pinterest), social networks
LLM-enhanced	Large language model for encoding, ranking, or explanation	Rich semantics; zero-shot capability; natural explanations	High latency; hallucination risk	Conversational recs, catalog enrichment
Multi-armed bandits	Explore-exploit algorithms	Handles cold start; adapts online; principled exploration	Simple models may underperform; slower convergence	News recommendation, ad placement
Knowledge-based	Domain rules and constraints	No cold start; works with infrequent purchases	Requires domain knowledge engineering	Real estate, automotive, finance

References

Goldberg, D., Nichols, D., Oki, B. M., and Terry, D. (1992). "Using Collaborative Filtering to Weave an Information Tapestry." *Communications of the ACM*, 35(12), 61-70.
Linden, G., Smith, B., and York, J. (2003). "Amazon.com Recommendations: Item-to-Item Collaborative Filtering." *IEEE Internet Computing*, 7(1), 76-80.
Bennett, J. and Lanning, S. (2007). "The Netflix Prize." *Proceedings of the KDD Cup Workshop 2007*.
Hu, Y., Koren, Y., and Volinsky, C. (2008). "Collaborative Filtering for Implicit Feedback Datasets." *Proceedings of the IEEE International Conference on Data Mining (ICDM '08)*.
Rendle, S., Freudenthaler, C., Gantner, Z., and Schmidt-Thieme, L. (2009). "BPR: Bayesian Personalized Ranking from Implicit Feedback." *Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence (UAI '09)*.
Koren, Y., Bell, R., and Volinsky, C. (2009). "Matrix Factorization Techniques for Recommender Systems." *Computer*, 42(8), 30-37.
Covington, P., Adams, J., and Sargin, E. (2016). "Deep Neural Networks for YouTube Recommendations." *Proceedings of the 10th ACM Conference on Recommender Systems (RecSys '16)*, 191-198.
Cheng, H.-T., Koc, L., Harmsen, J., et al. (2016). "Wide & Deep Learning for Recommender Systems." *Proceedings of the 1st Workshop on Deep Learning for Recommender Systems (DLRS '16)*, 7-10.
Hidasi, B., Karatzoglou, A., Baltrunas, L., and Tikk, D. (2016). "Session-based Recommendations with Recurrent Neural Networks." *Proceedings of ICLR 2016*.
He, X., Liao, L., Zhang, H., Nie, L., Hu, X., and Chua, T.-S. (2017). "Neural Collaborative Filtering." *Proceedings of the 26th International Conference on World Wide Web (WWW '17)*, 173-182.
Guo, H., Tang, R., Ye, Y., Li, Z., and He, X. (2017). "DeepFM: A Factorization-Machine based Neural Network for CTR Prediction." *Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI '17)*.
Wang, R., Fu, B., Fu, G., and Wang, M. (2017). "Deep & Cross Network for Ad Click Predictions." *Proceedings of the ADKDD '17*.
Kang, W.-C. and McAuley, J. (2018). "Self-Attentive Sequential Recommendation." *IEEE International Conference on Data Mining (ICDM '18)*, 197-206.
Ying, R., He, R., Chen, K., Eksombatchai, P., Hamilton, W. L., and Leskovec, J. (2018). "Graph Convolutional Neural Networks for Web-Scale Recommender Systems." *Proceedings of the 24th ACM SIGKDD International Conference*, 974-983.
Naumov, M., Mudigere, D., Shi, H. M., et al. (2019). "Deep Learning Recommendation Model for Personalization and Recommendation Systems." *arXiv:1906.00091*.
Sun, F., Liu, J., Wu, J., et al. (2019). "BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformers." *Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM '19)*, 1441-1450.
He, X., Deng, K., Wang, X., Li, Y., Zhang, Y., and Wang, M. (2020). "LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation." *Proceedings of the 43rd International ACM SIGIR Conference*, 639-648.
European Parliament. (2022). "Digital Services Act (DSA)." Regulation (EU) 2022/2065.
Wu, L., Zheng, Z., Qiu, Z., et al. (2024). "A Survey on Large Language Models for Recommendation." *World Wide Web*, 27, Article 44.

Explain like I'm 5 (ELI5)

Historical development

Types of recommendation systems

Content-based filtering

Collaborative filtering

User-based collaborative filtering

Item-based collaborative filtering

Hybrid approaches

Knowledge-based recommendations

Matrix factorization

Singular value decomposition (SVD)

Alternating least squares (ALS)

Non-negative matrix factorization (NMF)

Bayesian personalized ranking (BPR)

Deep learning approaches

Neural collaborative filtering (NCF)

Wide and deep learning

Deep learning recommendation model (DLRM)

DeepFM

Deep and cross network (DCN)

Two-tower models

Sequential and session-based recommendation

Markov chain models

Recurrent neural networks

SASRec: self-attentive sequential recommendation

BERT4Rec

Large language models and recommendation

Graph neural networks for recommendation

Industry recommendation systems

Multi-stage pipeline architecture

YouTube

Netflix

Amazon

Spotify

TikTok

Meta

Pinterest

LinkedIn

Scalability

Approximate nearest neighbor (ANN) search

Two-stage retrieval and ranking

Cold start problem

Solutions for new users

Solutions for new items

Evaluation of recommendation systems

Offline evaluation metrics

Online evaluation

Beyond accuracy

Bias and ethical considerations

Popularity bias

Filter bubbles and echo chambers

Fairness and discrimination

Privacy

Manipulation and adversarial attacks

Addiction and user well-being

Regulation

Applications beyond e-commerce

Comparison of approaches

References

Improve this article

Related Articles

Machine learning terms/Recommendation Systems

ARC-AGI 2

Candidate Generation

Wide Model

Collaborative filtering

Item matrix

Explain like I'm 5 (ELI5)

Historical development

Types of recommendation systems

Content-based filtering

Collaborative filtering

User-based collaborative filtering

Item-based collaborative filtering

Hybrid approaches

Knowledge-based recommendations

Matrix factorization

Singular value decomposition (SVD)

Alternating least squares (ALS)

Non-negative matrix factorization (NMF)