See also: Machine learning terms
introduction
A recommendation system (also called a recommender system) is a class of machine learning software that predicts the preference a user would give to an item, then surfaces a ranked list of items the user is likely to enjoy. Recommendation engines power product suggestions on Amazon, video feeds on YouTube and TikTok, music playlists on Spotify, social posts on Facebook and Instagram, and pins on Pinterest. A 2013 McKinsey report attributed roughly 35% of Amazon's revenue and 75% of Netflix viewer watch time to recommendations.
A recommender takes three core inputs: a set of users, a set of items, and a record of interactions between them (ratings, clicks, watches, purchases, dwell time, likes). The output is a personalised ranking of items for each user, usually computed as a scoring function followed by re-ranking. At industrial scale the pipeline is split into candidate generation, scoring, and re-ranking because exhaustively ranking every item for every user is infeasible when catalogues contain hundreds of millions of items.
This page surveys the dominant families of recommendation algorithms, the milestone systems that defined each era, the evaluation metrics that practitioners use, and the open source libraries that make these techniques accessible.
why recommendation systems matter
Catalogue size has exploded: Amazon lists hundreds of millions of products, Spotify hosts more than 100 million tracks, and YouTube receives more than 500 hours of video uploads per minute. Without algorithmic filtering, users cannot navigate this long tail. Attention is scarce, so platforms compete on relevance per second of user time. Recommenders also create a self reinforcing data flywheel: better recommendations drive engagement, which produces more interaction data, which trains better models.
classical approaches
collaborative filtering
Collaborative filtering (CF) makes recommendations using only the user item interaction matrix, with no information about the content of items or the demographics of users. The core idea is that users who agreed in the past tend to agree in the future. CF was popularised by the GroupLens research group at the University of Minnesota in their 1994 paper on Usenet news filtering, and remains the conceptual foundation of the field.
There are two flavours of memory based CF:
| Variant | How it works | Best for |
|---|
| User user CF | Finds users similar to the target user (via cosine similarity or Pearson correlation) and aggregates their ratings on candidate items | Small user bases, dense interaction matrices |
| Item item CF | Finds items similar to those the target user already liked, then recommends them | Catalogues where items change less often than users; dominant approach at Amazon since 2003 |
Amazon's 2003 paper Item to Item Collaborative Filtering by Linden, Smith, and York pushed item item CF into industrial use. Item similarities can be precomputed offline, so serving cost stays low even as the user base grows.
content based filtering
Content based filtering recommends items whose features resemble items the user already liked. A film recommender might encode each movie as a vector of genres, actors, and keywords, then compute cosine similarity to the user's profile. Pandora's Music Genome Project, launched in 2000, is the textbook example: human analysts tag each song with up to 450 musical attributes, and the recommender finds songs whose attribute vectors lie close to those the listener has thumbed up. Content based methods avoid the pure cold start problem for new items but cannot exploit the wisdom of the crowd and tend to produce homogeneous suggestions.
hybrid recommenders
Hybrid systems combine collaborative and content signals. Common blending strategies include weighted hybrids (linear combination of scores), switching hybrids (use content based when collaborative data is sparse), and feature combination (feed both signals into a single ranker). Robin Burke's 2002 survey Hybrid Recommender Systems enumerates seven classes.
matrix factorization
Matrix factorization (MF) was the dominant family of recommendation algorithms from the mid 2000s through the mid 2010s. It decomposes the sparse user item interaction matrix R of shape (m users by n items) into two low rank matrices: a user matrix U of shape (m, k) and an item matrix V of shape (n, k), where k is the embedding dimension, typically 16 to 200. The predicted rating of user u for item i is the dot product of their latent vectors.
funk svd and the netflix prize
In October 2006 Netflix launched the Netflix Prize, a one million dollar competition for any team that could improve the company's Cinematch rating predictor by 10%. Simon Funk (pseudonym of Brandon Webb) published a blog post in December 2006 describing an SGD variant of SVD that handled missing entries gracefully. His method, retroactively named Funk SVD, became the seed for almost every leading entry. The loss was squared prediction error plus L2 regularisation on the latent factors.
als and weighted als
Alternating Least Squares (ALS) fixes one factor matrix and solves the other in closed form, then alternates. Each subproblem is a regularised least squares regression and parallelises trivially across users or items, which made ALS the algorithm of choice on Hadoop and Spark clusters. The weighted alternating least squares (WALS) variant adds a confidence weight to each observation, useful when the dataset combines explicit ratings with weaker implicit signals.
svd plus plus
SVD++, proposed by Yehuda Koren in 2008, augments standard SVD with implicit feedback by adding a term that sums embeddings of all items the user has interacted with, regardless of rating. SVD++ was a building block of BellKor's Pragmatic Chaos, which won the Netflix Prize in September 2009 with an RMSE of 0.8567, a 10.06% improvement over Cinematch.
bayesian probabilistic matrix factorization
Bayesian Probabilistic Matrix Factorization (BPMF), introduced by Salakhutdinov and Mnih in 2008, treats user and item factors as random variables with Gaussian priors and uses MCMC (Gibbs sampling) for inference. BPMF is robust to overfitting and gives calibrated uncertainty estimates at higher compute cost.
the netflix prize blend
The winning entry was a linear combination of more than 800 predictors including SVD++, restricted Boltzmann machines, asymmetric SVD, neighbourhood blends, and time aware models. Ensembles of diverse complementary models almost always outperform any single algorithm.
implicit feedback
Most real world interaction data is implicit: clicks, plays, dwell time, purchases, scroll depth. Implicit signals are abundant but noisy and one sided, since users do not produce explicit dislikes for items they ignore. The seminal paper Collaborative Filtering for Implicit Feedback Datasets by Hu, Koren, and Volinsky (IEEE ICDM 2008) reframed the problem as predicting binary preference (interacted or not) weighted by a confidence c that grows with the count of interactions. The objective is solved with a weighted ALS step that scales linearly in observations. The Hu Koren Volinsky formulation underpins the open source Implicit library and remains a strong baseline.
neural and deep learning approaches
autorec
AutoRec (Sedhain et al., WWW 2015) was one of the first neural recommenders to gain traction. It treats CF as an autoencoder problem: the input is a partially observed row or column of the rating matrix, and the network learns to reconstruct it. AutoRec outperformed plain MF on MovieLens benchmarks.
neural collaborative filtering
Neural Collaborative Filtering (NCF) by He et al. (WWW 2017) replaced the inner product in MF with an MLP applied to the concatenation of user and item embeddings. The paper introduced the NeuMF architecture, a fusion of generalised MF and an MLP tower. NCF was hugely influential, although a 2020 reproducibility study by Rendle et al. argued that well tuned MF baselines match NCF on standard datasets.
wide and deep
Google's Wide and Deep Learning for Recommender Systems (Cheng et al., 2016) was deployed in the Google Play Store. The wide arm is a linear model over crossed categorical features that excels at memorisation; the deep arm is an MLP over dense embeddings that excels at generalisation. The two arms are jointly trained. Wide and Deep set the template for hybrid architectures and inspired DCN, DeepFM, xDeepFM, and DIN.
factorization machines and deepfm
Factorization Machines (Rendle, 2010) generalise MF to arbitrary feature interactions. DeepFM (Guo et al., 2017) plugs an FM layer in alongside a deep network so that low order and high order interactions are modelled jointly, without manual feature engineering.
sequential and session based recommendation
Order matters. The next song you play depends on the last three you played, not on a static profile. Sequential recommenders model user behaviour as a sequence and predict the next item.
| Model | Year | Architecture | Notes |
|---|
| GRU4Rec | 2016 | Gated recurrent unit RNN | Hidasi et al.; first big neural session model |
| Caser | 2018 | Convolutional filters over recent items | Captures union level and point level patterns |
| SASRec | 2018 | Self attention only | Kang and McAuley; precursor to BERT4Rec |
| BERT4Rec | 2019 | Bidirectional Transformer, BERT style masking | Sun et al., Alibaba |
| S3Rec | 2020 | Self supervised pretraining | Mutual information maximisation across attributes |
| TiSASRec | 2020 | Time interval aware self attention | Adds temporal gaps between events |
| gSASRec | 2023 | Generalised SASRec with negative sampling fix | Beats BERT4Rec on several benchmarks |
Sequential models are now the workhorses of short video and music feeds.
two tower and dual encoder models
At very large scale (think YouTube, with billions of users and billions of videos) you cannot afford to score every (user, item) pair with a deep network. The two tower or dual encoder architecture trains a user tower and an item tower that each output a fixed dimensional embedding, then defines the score as a dot product or cosine similarity. Item embeddings can be indexed in an approximate nearest neighbour structure such as ScaNN or Faiss, and retrieval becomes sub linear in catalogue size.
The canonical reference is Deep Neural Networks for YouTube Recommendations by Covington, Adams, and Sargin (RecSys 2016). Their two stage architecture (a candidate generation network plus a ranking network) is the blueprint that most large platforms still follow. Sampling negatives correctly is the central engineering challenge: in batch negatives, mixed negative sampling, and sampled softmax with logQ correction are all standard tricks. Google's 2019 paper Sampling Bias Corrected Neural Modeling for Large Corpus Item Recommendations derived a streaming logQ correction that lets the system train on fresh logs.
graph based recommendation
User item interaction data forms a bipartite graph, and richer graphs include item item co occurrence, social ties, and knowledge graph relations. Graph neural networks (GNNs) propagate embeddings along edges so that an item's representation reflects its multi hop neighbourhood.
pinsage
PinSage (Ying et al., KDD 2018) was the first graph convolutional recommender deployed at web scale. Built at Pinterest on three billion pins and 18 billion edges, PinSage uses random walk based neighbour sampling and importance pooling, sidestepping the memory cost of full graph convolutions. Offline metrics improved by 150% over the previous content based system.
lightgcn
LightGCN (He et al., SIGIR 2020) stripped GNNs down to their essentials by removing feature transformation and non linear activation. The remaining operation is iterative neighbourhood aggregation followed by averaging across layers. LightGCN beat NGCF on all three standard benchmarks and became the strongest GNN baseline.
knowledge graph aware recommenders
KGAT, RippleNet, and KGCN integrate external knowledge graphs (DBpedia, Wikidata) so that recommendations exploit factual relations like (movie, directed by, director). These methods help with cold start because new items inherit signal through their knowledge graph neighbours.
multi stage retrieval and ranking
Industrial recommenders are pipelines, not monolithic models. A typical stack has four stages:
| Stage | Latency budget | Catalog size at stage | Typical model |
|---|
| Candidate generation | 10 to 50 ms | Billions to thousands | Two tower ANN, item to item, popularity |
| First stage ranking | 20 to 100 ms | Thousands to hundreds | Lightweight DNN, GBDT |
| Second stage ranking | 50 to 200 ms | Hundreds to tens | Heavy DNN, transformer, multi task |
| Re-ranking | 5 to 50 ms | Tens to top N | Diversity rules, business logic, fairness |
Each stage trades model expressiveness for latency. The candidate generator must be cheap but high recall, the ranker must be accurate but can afford more compute, and the re-ranker enforces business constraints.
cold start
The cold start problem arises when the system lacks interaction data for a new user, a new item, or an entire deployment. Standard mitigations:
- Content features. Use side information about the new item (text, image, metadata) so the model can place it in embedding space without interaction history.
- Demographic features. For new users, use age, location, device, or onboarding survey answers as a fallback profile.
- Transfer learning. Pretrain on a large auxiliary corpus and fine tune on the target domain. Foundation model embeddings (CLIP for images, sentence transformers for text) are popular item encoders.
- Exploration. Insert randomised or epsilon greedy items so the system gathers data on new entries. Multi armed bandit approaches such as LinUCB and Thompson sampling formalise the explore exploit tradeoff.
- Meta learning. MAML style few shot learners adapt quickly to new users.
- Prompt based generation. LLM based recommenders zero shot rank new items from textual descriptions.
industrial recommendation systems
netflix
Beyond the original ratings recommender, Netflix personalises artwork (which thumbnail to show), row ordering on the home page, search results, and autoplay trailers. The 2016 paper The Netflix Recommender System by Gomez Uribe and Hunt is required reading. The company runs hundreds of A/B tests per quarter and credits personalisation with reducing monthly churn by an estimated one percentage point.
youtube
YouTube's recommender drives more than 70% of watch time according to public statements by company executives. The architecture (Covington 2016) is two stage: candidate generation via deep neural networks, then a ranking model combining hundreds of features including video age, click context, and predicted watch time. Recent developments include reinforcement learning for long term satisfaction (REINFORCE recommender, 2019) and large scale transformer rankers.
spotify
Spotify's Discover Weekly playlist, launched in July 2015, generates a personalised 30 song mix every Monday for over 600 million users. The system blends three signals: collaborative filtering over playlist co occurrence, NLP analysis of music blogs and reviews, and audio analysis using convolutional networks on raw spectrograms. Spotify also operates Daily Mixes, Release Radar, and DJ AI (2023, an LLM voiced personalised radio).
pinterest
Pinterest pioneered PinSage, the first web scale graph convolutional recommender, and runs related pins, search, and home feed off of Pin and Board embeddings. The Homefeed Personalization talks describe a multi objective ranker that balances engagement, diversity, freshness, and advertiser content.
tiktok
TikTok's For You feed is credited with raising the bar for engagement. Public statements describe a ranking model that scores candidates by predicted likes, comments, completion rate, and watch time, with strong emphasis on negative feedback signals (skips, not interested taps), and surfaces niche creators quickly through aggressive exploration.
Amazon's item to item collaborative filter (Linden et al., 2003) has evolved into a sophisticated pipeline that drives product recommendations across the storefront. Alibaba's Deep Interest Network (DIN, 2018) and Deep Interest Evolution Network (DIEN, 2019) introduced attention mechanisms over a user's historical behaviour, weighting past items by their relevance to the candidate. Taobao runs TDM (Tree based Deep Models) for billion scale retrieval. Meta's feed and Reels recommenders use multi task DNN rankers; the company open sourced DLRM (2019) and the TorchRec library.
llm era recommendation
Large language models have begun to reshape recommendation in three ways.
First, LLMs serve as item encoders. The text describing an item (title, tags, reviews) is fed to a pretrained model such as Llama or a sentence transformer, and the resulting embedding is consumed by a downstream ranker. This sidesteps cold start because new items inherit semantic meaning from text alone.
Second, LLMs power conversational recommenders. Spotify's DJ, Amazon's Rufus (2024), and Meta AI inside Instagram all let users describe what they want in natural language.
Third, recent research treats recommendation as language modelling. Approaches such as P5 (Geng et al., RecSys 2022), LLaRA, RecLLM, GenRec, and TallRec serialise user histories into token sequences and fine tune an LLM to predict the next item identifier. Generative retrieval with Semantic IDs (Rajput et al., NeurIPS 2023) replaces hashed item IDs with hierarchically clustered semantic codes the model can decode token by token. Apple's MLX framework (released December 2023) has been used to fine tune small recommender LLMs that run on device.
Whether LLM recommenders will displace classical two tower retrieval at scale is unsettled, since latency and serving cost remain orders of magnitude higher. Most production deployments in 2025 use LLMs for offline candidate enrichment and explanation rather than online ranking.
evaluation metrics
Recommender quality is measured along several axes. Offline metrics use a held out test set; online metrics come from A/B tests in production.
| Metric | Family | What it captures | Range |
|---|
| RMSE | Rating prediction | Root mean squared error | 0 to unbounded |
| MAE | Rating prediction | Mean absolute error | 0 to unbounded |
| Precision@k | Top N | Fraction of top k items that are relevant | 0 to 1 |
| Recall@k | Top N | Fraction of relevant items captured in top k | 0 to 1 |
| Hit rate@k | Top N | At least one relevant item appears in top k | 0 or 1 |
| MAP | Top N, ranked | Mean average precision | 0 to 1 |
| MRR | Top N, ranked | Mean reciprocal rank of first relevant item | 0 to 1 |
| NDCG@k | Top N, ranked | Normalised discounted cumulative gain | 0 to 1 |
| AUC | Pairwise | Probability a positive scores higher than a negative | 0.5 to 1 |
| Coverage | Catalogue | Fraction of items the system ever recommends | 0 to 1 |
| Diversity | List | Average dissimilarity among recommended items | 0 to 1 |
| Novelty | List | Inverse popularity of recommended items | 0 to log N |
| Serendipity | List | Unexpected and useful recommendations | informal |
NDCG is the most reported ranking metric in academic papers because it handles graded relevance and position bias gracefully. A/B tested click through rate, watch time, and retention drive production decisions, since offline metrics correlate imperfectly with online behaviour.
diversity, serendipity, and fairness
Maximising predicted relevance can produce filter bubbles and echo chambers, and it can amplify popularity bias. Counterweights include:
- Maximal marginal relevance (Carbonell and Goldstein, 1998) re-ranks results to balance relevance against novelty.
- Determinantal point processes (DPPs) sample diverse subsets by maximising the determinant of a kernel matrix (Kulesza and Taskar, 2012).
- Calibrated recommendations (Steck, RecSys 2018) ensures the topic distribution of recommendations matches the user's historical distribution.
- Fairness aware ranking constrains exposure across protected groups; works include FA*IR (Zehlike et al., 2017) and Equity of Attention (Biega et al., 2018).
- Provider side fairness ensures that creators or sellers receive a fair share of impressions, an active research direction at Spotify, Etsy, and Airbnb.
The European Union's Digital Services Act (effective 2024) requires large platforms to provide recommender systems that do not rely on profiling.
open source libraries
| Library | Maintainer | Strengths |
|---|
| TensorFlow Recommenders (TFRS) | Google | Two tower retrieval and ranking, deployed via TF Serving |
| Microsoft Recommenders | Microsoft | Reference implementations of 25+ algorithms with benchmarks |
| TorchRec | Meta | Distributed embeddings for billion scale models in PyTorch |
| Implicit | Ben Frederickson | Fast ALS, BPR, and LMF for implicit data |
| LightFM | Maciej Kula | Hybrid of MF and content features in Cython |
| Spotlight | Maciej Kula | PyTorch sequential and factorisation models |
| Surprise | Nicolas Hug | Classic CF algorithms, scikit learn style API |
| RecBole | Renmin University | 90+ models, unified benchmark for academic research |
| Cornac | Preferred Networks | Multimodal recommenders, focus on reproducibility |
| Faiss | Meta | Approximate nearest neighbour search for retrieval |
| ScaNN | Google | Quantisation based ANN, optimised for two tower serving |
| NVIDIA Merlin | NVIDIA | GPU accelerated ETL, training, and inference for recsys |
a brief timeline
| Year | Milestone |
|---|
| 1992 | Tapestry (Goldberg et al., Xerox PARC) coins collaborative filtering |
| 1994 | GroupLens at Minnesota applies CF to Usenet news |
| 2000 | Pandora's Music Genome Project starts |
| 2003 | Amazon publishes item to item CF |
| 2006 | Netflix Prize launches; Funk SVD blog post |
| 2008 | Hu Koren Volinsky on implicit feedback; SVD++ |
| 2009 | BellKor's Pragmatic Chaos wins Netflix Prize |
| 2010 | Steffen Rendle introduces Factorization Machines |
| 2015 | Spotify launches Discover Weekly; AutoRec |
| 2016 | YouTube DNN paper; Wide and Deep; GRU4Rec |
| 2017 | Neural Collaborative Filtering; DeepFM |
| 2018 | PinSage; Alibaba DIN; SASRec |
| 2019 | BERT4Rec; Meta DLRM open sourced |
| 2020 | LightGCN; Rendle reproducibility study questions NCF |
| 2022 | P5: recommendation as language modelling |
| 2023 | Generative retrieval with semantic IDs; Spotify DJ AI |
| 2024 | Apple MLX recsys experiments; Amazon Rufus; EU DSA non profiling rules |
index of recommendation system terms on this wiki
references
- Goldberg, D., Nichols, D., Oki, B. M., and Terry, D. (1992). Using collaborative filtering to weave an information tapestry. CACM 35(12).
- Resnick, P. et al. (1994). GroupLens: An open architecture for collaborative filtering of netnews. CSCW.
- Linden, G., Smith, B., and York, J. (2003). Amazon.com recommendations: item to item collaborative filtering. IEEE Internet Computing.
- Funk, S. (2006). Netflix Update: Try This at Home. sifter.org.
- Koren, Y. (2008). Factorization meets the neighborhood. KDD.
- Hu, Y., Koren, Y., and Volinsky, C. (2008). Collaborative filtering for implicit feedback datasets. IEEE ICDM.
- Salakhutdinov, R., and Mnih, A. (2008). Bayesian probabilistic matrix factorization using MCMC. ICML.
- Rendle, S. (2010). Factorization Machines. IEEE ICDM.
- Sedhain, S. et al. (2015). AutoRec: Autoencoders meet collaborative filtering. WWW.
- Cheng, H. T. et al. (2016). Wide and Deep Learning for Recommender Systems. DLRS, RecSys.
- Covington, P., Adams, J., and Sargin, E. (2016). Deep Neural Networks for YouTube Recommendations. RecSys.
- Gomez Uribe, C. A., and Hunt, N. (2016). The Netflix recommender system. ACM TMIS 6(4).
- Hidasi, B. et al. (2016). Session based recommendations with recurrent neural networks. ICLR.
- He, X. et al. (2017). Neural Collaborative Filtering. WWW.
- Guo, H. et al. (2017). DeepFM. IJCAI.
- Kang, W. C., and McAuley, J. (2018). Self attentive sequential recommendation. IEEE ICDM.
- Ying, R. et al. (2018). Graph convolutional neural networks for web scale recommender systems (PinSage). KDD.
- Zhou, G. et al. (2018). Deep Interest Network for click through rate prediction. KDD.
- Sun, F. et al. (2019). BERT4Rec. CIKM.
- Naumov, M. et al. (2019). DLRM. arXiv:1906.00091.
- He, X. et al. (2020). LightGCN. SIGIR.
- Rendle, S. et al. (2020). Neural collaborative filtering vs. matrix factorization revisited. RecSys.
- Geng, S. et al. (2022). P5: Recommendation as language processing. RecSys.
- Rajput, S. et al. (2023). Recommender systems with generative retrieval. NeurIPS.