Recommender Systems
Last reviewed
May 4, 2026
Sources
21 citations
Review status
Source-backed
Revision
v1 ยท 4,496 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 4, 2026
Sources
21 citations
Review status
Source-backed
Revision
v1 ยท 4,496 words
Add missing citations, update stale details, or suggest a clearer explanation.
Recommender systems are a class of information filtering systems that predict the preference or rating that a user would give to an item, and use those predictions to surface items the user is most likely to want. They power product suggestions on e-commerce sites, video and music feeds on streaming platforms, content ranking in social networks, and personalized advertising across the open web.
From a research perspective, recommender systems sit at the intersection of machine learning, information retrieval, statistics, and human-computer interaction. They take in sparse signals (clicks, ratings, dwell times, purchases, follows) and dense side information (item metadata, user profiles, context) and produce an ordered list of items chosen from a catalog that is often orders of magnitude larger than what a user could ever browse manually.
In industry, recommendation has become one of the most economically important applications of artificial intelligence. The same architectural patterns appear across companies that look outwardly very different: a candidate generator narrows billions of items down to a few hundred or a few thousand, a ranking model scores those candidates with a deep network, and a re-ranker enforces business rules such as diversity, freshness, and fairness. The plural form "recommender systems" emphasizes this systems-engineering view of multiple cooperating models, ANN indexes, feature stores, and serving infrastructure, in contrast to the singular recommender system which is more often used for an individual algorithm or model.
A recommender system is any system that, given a representation of a user (which may be as little as a session of anonymous clicks) and a catalog of items, returns a ranked subset of items predicted to be of interest. Items can be products, articles, songs, videos, ads, friends to follow, jobs, or restaurants. Three formal tasks dominate the literature:
Recommender systems overlap heavily with neighbouring fields. They share evaluation metrics and ranking models with information retrieval, share latent variable models with NLP (the same low-rank embedding ideas that power word embeddings also underpin matrix factorization for ratings), and share auction mechanics and click prediction with computational advertising. Many large companies treat search, recommendation, and ads as variants of the same underlying retrieval-and-ranking stack with different optimization objectives.
The Grundy system, described in Elaine Rich's 1979 Cognitive Science article "User Modeling via Stereotypes" (drawn from her Carnegie Mellon PhD work; Rich later joined the University of Texas at Austin), recommended novels to readers. Grundy asked the user a small set of self-description questions, mapped the answers to stereotypes representing groups of readers, and used those stereotypes to predict which books the user would enjoy. Although it did not use ratings from other users, Grundy introduced the idea of a learned user model that drives item suggestions, an idea central to all later recommenders.
The term collaborative filtering was coined in the 1992 paper "Using Collaborative Filtering to Weave an Information Tapestry" by David Goldberg, David Nichols, Brian Oki, and Douglas Terry, working at Xerox PARC. The Tapestry system let users at PARC annotate documents (mostly emails and news articles) and write filter queries that referenced other users' annotations. It was a manual, query-based form of collaborative filtering rather than the automatic version we use today, but it gave the field its name.
In 1994, Paul Resnick, Neophytos Iacovou, Mitesh Suchak, Peter Bergstrom, and John Riedl published "GroupLens: An Open Architecture for Collaborative Filtering of Netnews" at the ACM Conference on Computer Supported Cooperative Work. GroupLens, developed at the University of Minnesota, is widely cited as the first automated collaborative filter: it computed Pearson-correlation neighborhoods over Usenet ratings and predicted scores for unread articles. The same group launched the MovieLens website in 1997 to gather research data on personalized movie recommendations, and went on to release the standard benchmark datasets (100K through 32M ratings) that still dominate offline evaluation in academic papers.
Amazon engineers Greg Linden, Brent Smith, and Jeremy York developed an item-to-item collaborative filter for Amazon in the late 1990s and described it in their 2003 IEEE Internet Computing article "Amazon.com Recommendations: Item-to-Item Collaborative Filtering." Instead of finding similar users (which scaled poorly), they precomputed a similar-items table by measuring how often pairs of items were purchased together relative to a baseline. In 2017, IEEE Internet Computing named the article the journal's "test of time" paper.
On 2 October 2006, Netflix launched the Netflix Prize, a US$1,000,000 competition to improve the root mean square error (RMSE) of its Cinematch movie-rating predictor by at least 10 percent on a held-out subset of a 100-million-rating dataset. The competition ran for nearly three years; the winning team, BellKor's Pragmatic Chaos (a merger of researchers from AT&T Labs, Pragmatic Theory, and BigChaos), achieved a 10.06 percent test-RMSE improvement and was awarded the prize on 21 September 2009. The Prize popularized matrix factorization, ensembling, and SGD solvers for very large, very sparse rating matrices. Netflix has stated publicly that it never put the full prize-winning ensemble into production because of its complexity.
In August 2009, Yehuda Koren, Robert Bell, and Chris Volinsky published "Matrix Factorization Techniques for Recommender Systems" in IEEE Computer (vol. 42, no. 8). The article distilled techniques the authors had used during the Netflix Prize and became the canonical reference for SVD-style latent factor models. In December 2010, Steffen Rendle published "Factorization Machines" at IEEE ICDM, generalizing matrix factorization to handle arbitrary categorical and numerical features through pairwise factorized interactions in linear training time.
Deep learning entered recommender systems through several landmark industrial papers. Paul Covington, Jay Adams, and Emre Sargin from Google published "Deep Neural Networks for YouTube Recommendations" at RecSys 2016, introducing the now-standard two-stage architecture (deep candidate generator plus deep ranker) for video recommendation at YouTube scale. At the same conference, Heng-Tze Cheng and colleagues at Google published "Wide and Deep Learning for Recommender Systems" in the 1st Workshop on Deep Learning for Recommender Systems (DLRS); the model jointly trained a wide linear part (to memorize sparse feature crosses) with a deep neural network part (to generalize via low-dimensional embeddings) and shipped in Google Play. Xiangnan He and colleagues introduced "Neural Collaborative Filtering" at WWW 2017, popularizing MLP-based replacements for the dot product in matrix factorization. Wang and colleagues at Google introduced the Deep & Cross Network at ADKDD 2017; the improved DCN-V2 followed at WWW 2021. SASRec (Kang and McAuley, ICDM 2018) used causal self-attention to predict the next item, and BERT4Rec (Sun et al., CIKM 2019) used a bidirectional masked-item objective, demonstrating that the transformer architecture worked well for next-item prediction. In 2018, Pinterest engineers and Stanford researchers published the PinSage paper at KDD, describing a GraphSAGE-derived GCN trained on a 3-billion-node graph of Pins and boards at Pinterest.
At RecSys 2019, Maurizio Ferrari Dacrema, Paolo Cremonesi, and Dietmar Jannach published "Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches," which won the conference's Best Long Paper award. The paper showed that several recently published neural recommenders, when re-implemented and properly tuned, were beaten by carefully tuned classical baselines such as item-kNN, matrix factorization, and pure popularity. The work prompted renewed attention to baseline tuning, dataset splits, and reproducible code.
In 2022, ByteDance published "Monolith: Real Time Recommendation System with Collisionless Embedding Table," describing the cuckoo-hash-based embedding store and online training architecture used in TikTok-style products. In 2023, Rajput and colleagues at Google published "Recommender Systems with Generative Retrieval" at NeurIPS, introducing the TIGER approach in which items are encoded as tuples of discrete semantic IDs (via RQ-VAE) and a transformer decoder generates the next item's ID autoregressively. From 2023 onward, large language models began to be used as zero-shot rankers, explanation generators, and data augmenters for recommendation.
The field has accumulated a large taxonomy of recommender designs. The table below summarizes the families that appear most often in production and in the research literature.
| Family | Core idea | Typical strengths | Typical weaknesses |
|---|---|---|---|
| Content-based filtering | Recommend items whose features resemble items the user previously liked | Works for new items; explainable | Needs good item features; poor diversity |
| User-based collaborative filtering | Find users similar to the target user; recommend what they liked | Captures taste through behaviour, no item features needed | Sparse user-user similarity; poor scalability |
| Item-based collaborative filtering | Precompute item-item similarity; recommend items similar to those the user liked | Stable similarities, scales well; Amazon's classic | Limited diversity; popular items dominate |
| Matrix factorization | Learn low-rank user and item embeddings whose dot product predicts rating or interaction | Strong baseline; compact; many extensions | Hard to add side features; cold start |
| Factorization machines | Generalize MF to arbitrary features via pairwise factorized interactions | Handles rich side features; linear time | Pairwise interactions only |
| Hybrid systems | Combine content, collaborative, and other signals | Robust; mitigates cold start | More moving parts |
| Knowledge-based | Reason over an explicit knowledge base or constraints | Useful for big-ticket items (cars, real estate) | Requires curated knowledge |
| Demographic | Recommend based on age, gender, region, etc. | Works under cold start | Coarse, raises fairness concerns |
| Context-aware | Conditions on time, location, device, or session context | Captures situational preferences | Adds dimensionality |
| Sequential / session-based | Models the order of recent interactions (RNN, transformer) | Captures short-term intent | Data hungry |
| Graph-based | Learns embeddings on a user-item or item-item graph (PinSage, LightGCN) | Captures higher-order structure | Sampling tricky |
| Reinforcement learning-based | Treats recommendation as sequential decision making with delayed reward | Optimizes long-term engagement | Hard to evaluate |
| Generative recommendation | Decoder LM predicts next item ID autoregressively (e.g., TIGER) | Unifies retrieval and ranking | New, less mature in production |
Within collaborative filtering, a further distinction is often made between memory-based approaches (which compute predictions directly from stored interaction data, typically via similarity neighborhoods) and model-based approaches (which train a parametric model whose parameters compress the interaction data). User-based and item-based kNN are memory-based; matrix factorization, factorization machines, and neural recommenders are model-based.
User-user and item-item kNN compute a similarity (cosine, Pearson correlation, adjusted cosine, Jaccard) between rows or columns of the user-item interaction matrix, then predict by a similarity-weighted average of neighbours' ratings. Item-item kNN scales better than user-user kNN because the item-item similarity matrix can be precomputed offline and updated less frequently than user behaviour changes.
The canonical matrix factorization model represents each user and item as low-dimensional vectors in R^k and predicts the rating as the global mean plus user and item biases plus the dot product of the user and item vectors. Because most of the user-item matrix is missing, classical SVD does not apply directly. The parameters are fit by SGD or Alternating Least Squares (ALS) over only the observed entries, with an L2 penalty on the embedding magnitudes. The variant Simon Funk published during the Netflix Prize, often called Funk SVD, popularized SGD over latent factors. ALS alternates between fixing the user matrix and solving a regularized least-squares problem for the item matrix and vice versa; each subproblem decomposes per row, which makes ALS embarrassingly parallel. ALS is especially well suited to implicit-feedback data via the formulation by Hu, Koren, and Volinsky (2008).
Factorization machines (FM, Rendle 2010) generalize matrix factorization to a model that takes any real-valued feature vector and predicts a sum of a bias term, linear terms, and pairwise interactions parameterized by low-rank factor vectors. Field-aware Factorization Machines (FFM, Juan et al. 2016) extended this by giving each feature a different factor vector for each feature "field" it interacts with; FFMs were used by winning teams of several Criteo and Avazu CTR-prediction competitions.
Wide & Deep (Cheng et al. 2016) trains a wide linear model on hand-crafted feature crosses jointly with a deep network on dense embeddings of sparse features, capturing both memorization and generalization. Deep & Cross Network (DCN, Wang et al. 2017) replaces the manual crosses with a stack of "cross layers" that compute high-order feature interactions algorithmically; DCN-V2 (WWW 2021) replaces the cross weight vector with a matrix. Meta's DLRM (Naumov et al. 2019) standardized a similar pattern for ad ranking and is frequently used as a benchmark for recommendation hardware.
A two-tower model encodes the user (and context) with one neural network and the item with another, both producing vectors in the same embedding space. Training uses sampled softmax with in-batch negatives; serving uses approximate nearest-neighbour search (FAISS, ScaNN, HNSW, Vespa) over the precomputed item tower. Two-tower models are the dominant candidate generator at YouTube, TikTok, Spotify, Pinterest, and many other platforms. The YouTube team formalized the in-batch sampling-bias correction needed at scale in their 2019 RecSys paper.
Neural Collaborative Filtering (He et al., WWW 2017) replaces the dot product of MF with a multi-layer perceptron; later work (Rendle et al. 2020) showed that a well-tuned dot-product MF can match or exceed NCF on standard benchmarks. SASRec (ICDM 2018) applies a causal self-attention transformer to a user's interaction sequence to predict the next item; BERT4Rec (CIKM 2019) uses a bidirectional transformer with a masked-item objective. Multi-gate Mixture-of-Experts (MMoE, KDD 2018) splits a network into expert subnetworks with per-task gates, supporting the multi-objective rankers used at YouTube, TikTok, and other large platforms.
Most large-scale recommender systems use a multi-stage funnel rather than a single model. The funnel exists because the catalog is too large for any single deep network to score in real time, and because different stages benefit from different models and constraints.
| Stage | Catalog size | Latency budget | Typical models | Purpose |
|---|---|---|---|---|
| Candidate generation (retrieval) | 1e6 to 1e9 | a few ms | Two-tower DNNs, ANN over embeddings, item-kNN, popularity, recent | Reduce catalog to a few hundred or thousand candidates |
| Heavy ranking | 1e2 to 1e4 | tens of ms | DLRM, DCN-V2, transformer rankers, MMoE multi-task heads | Score each candidate with rich features |
| Re-ranking and policies | 1e1 to 1e2 | a few ms | Diversity (MMR, DPP), business rules, fairness constraints, freshness boosts, slate optimization | Produce the final ordered slate |
Candidate generation is typically retrieval-oriented and embedding-based. The user tower runs at request time over user features (recent interactions, profile, context); the item tower runs offline over the entire catalog and its outputs are indexed in an ANN library such as FAISS (Meta), ScaNN (Google), HNSW (open source), or Vespa. Multiple retrieval sources are usually unioned (a personalized two-tower retriever, a recent-activity retriever, a trending retriever, an editorial retriever) to ensure freshness and diversity.
Heavy ranking is where most of the model capacity lives. Industrial rankers consume hundreds of features (sparse user IDs, sparse item IDs, dense counters, contextual signals, cross features) and predict a calibrated probability of one or more outcomes. Multi-task heads share a backbone but predict different objectives such as click, watch-time, and like, and the final score is usually a weighted combination tuned via online experimentation.
Re-ranking enforces constraints that are awkward to express inside the ranking model: limit creator repetition, ensure at least one news item per slate, demote borderline content, balance demographic categories, and mix in cold-start exploration. This stage is often the home of explicit business rules, fairness interventions, and bandit-based exploration.
Evaluation of recommender systems is famously difficult because the system both shapes user behaviour and is evaluated on data that the system itself produced.
| Metric | Type | What it measures |
|---|---|---|
| RMSE | Offline accuracy | Root mean squared error of predicted ratings |
| MAE | Offline accuracy | Mean absolute error of predicted ratings |
| Precision@k | Offline ranking | Fraction of top-k that are relevant |
| Recall@k | Offline ranking | Fraction of relevant items found in top-k |
| NDCG@k | Offline ranking | Position-discounted gain over the top-k |
| MRR | Offline ranking | Mean reciprocal rank of the first relevant item |
| HitRate@k | Offline ranking | Whether at least one relevant item appears in the top-k |
| Coverage | Offline diversity | Fraction of catalog ever recommended |
| Intra-list diversity | Offline diversity | Average dissimilarity among items in a slate |
| CTR | Online | Click-through rate on shown items |
| Watch time | Online | Total or per-impression watch time |
| Session length | Online | Time or number of items per session |
| Retention / DAU | Online | Fraction of users returning over time |
| Revenue per user | Online | Direct monetary outcome |
A/B testing is the industry-standard arbiter of production decisions. Even when a candidate model wins on every offline metric, large platforms typically require a controlled online experiment to demonstrate a positive effect on a North Star metric (often watch time, retention, or revenue) before launch. Counterfactual evaluation techniques such as Inverse Propensity Scoring (IPS) attempt to estimate what an alternative policy would have done using logged data; they are useful for screening candidate models cheaply but rarely fully replace live experiments.
| Dataset | Domain | Approximate scale | Notes |
|---|---|---|---|
| MovieLens 100K to 32M | Movies | 1e5 to 3e7 ratings | GroupLens, the canonical academic benchmark |
| Netflix Prize | Movies | ~1e8 ratings, 480k users, 18k movies | Released for the 2006-2009 prize; later withdrawn |
| Amazon Product Reviews | E-commerce | Hundreds of millions of reviews | McAuley lab releases used for sequential recsys |
| Yelp Open Dataset | Business reviews | Tens of millions of reviews | Yelp Dataset Challenge |
| Last.fm 1K / 360K | Music | Listening histories | Long-standing music benchmark |
| Steam reviews | Video games | Millions of reviews | Sequential and content benchmarks |
| Goodreads | Books | Tens of millions of ratings | UCSD release |
| MIND | News | ~160k articles, 1M users | Microsoft News Dataset, EMNLP 2020 |
| KuaiRec / KuaiRand | Short video | Dense interaction matrices from Kuaishou | Counterfactual evaluation friendly |
The Netflix Prize dataset was withdrawn after a 2007 study by Narayanan and Shmatikov showed that supposedly anonymous user histories could be re-identified via public IMDb reviews, leading to a class action lawsuit and the cancellation of a planned second competition.
| Company / product | Domain | Public details |
|---|---|---|
| Netflix | Movies and TV | Multi-stage personalization across rows, artwork selection, and search |
| YouTube | Videos | Two-stage retrieval and ranking (Covington et al. 2016); two-tower with sampling-bias correction (2019); MMoE multi-task ranking |
| Spotify | Music and podcasts | Discover Weekly, Daily Mixes, Home; hybrid of CF, NLP over playlists, and audio embeddings |
| Amazon | E-commerce | Item-to-item CF (Linden et al. 2003) and decades of successor systems |
| TikTok / Douyin (ByteDance) | Short videos | The For You Page, powered by the Monolith real-time embedding-table system |
| Instagram and Facebook (Meta) | Feeds, Reels, ads | DLRM-family models on custom infrastructure; multi-task multi-objective ranking |
| Pins, boards | PinSage graph CNN for related Pins; two-tower retrieval; Pixie random walk | |
| LinkedIn (linkedin) | Jobs, feed, People You May Know | CF and content hybrids; large-scale graph models |
| Twitter / X | Timeline ranking | 2023 "For You" release described candidate sourcing, Heavy Ranker, and heuristics |
| eBay, Walmart, Alibaba | E-commerce | Session-based and sequential rankers; Alibaba's DIN and DIEN models are widely cited |
| Toutiao (ByteDance), Yahoo | News | Real-time CTR-driven feeds; bandit-based exploration over fresh stories |
The cold-start problem refers to the difficulty of producing useful recommendations for users or items with little or no interaction history. The three classic flavours are new-user cold start (a user just signed up), new-item cold start (a freshly uploaded video or product has no plays or purchases), and new-system cold start (an entirely new platform launches with no historical data).
Mitigations include falling back to content-based features (text embeddings, image embeddings, audio fingerprints, item taxonomies), popularity baselines, or demographic and contextual signals. Side-information-aware models such as factorization machines, two-tower models with item-content towers, and LLM-derived item embeddings tend to outperform interaction-only models in cold-start regimes. Active exploration via multi-armed bandits and contextual bandits is widely used in news feeds and ad systems where item lifetimes are short. Transfer learning across markets and verticals is another common approach at large platforms.
The field has wrestled with a recurring set of difficulties.
Several trends have reshaped the field since 2023. Sequential transformers in the SASRec and BERT4Rec families have become the default backbone for modeling user histories. Generative retrieval approaches such as TIGER cast next-item prediction as autoregressive generation of discrete semantic IDs. Large language models serve as zero-shot or fine-tuned rankers, explanation generators, and item encoders over textual catalog descriptions. Hyperscaler ad and feed teams have pushed embedding-table-dominated DLRM-style models past one trillion parameters. Multimodal item encoders combining text, image, audio, and video features are increasingly common, and RLHF-style preference modeling has been adapted to ranker training. Privacy pressure (App Tracking Transparency, third-party cookie deprecation, regional regulation) has accelerated work on on-device retrieval, federated learning, and differential privacy.
| Venue | Type | Focus |
|---|---|---|
| RecSys (ACM Conference on Recommender Systems, founded 2007) | Conference | Dedicated venue for recommender systems research |
| KDD (ACM SIGKDD) | Conference | Data mining; many production recommender papers |
| WSDM (Web Search and Data Mining) | Conference | Search and recommendation |
| SIGIR (ACM Conference on R&D in Information Retrieval) | Conference | Information retrieval; classic ranking papers |
| WWW (The Web Conference) | Conference | Web-scale systems including recommendation |
| NeurIPS, ICML, ICLR | Conferences | General ML, with growing recsys tracks |
| CIKM (Conference on Information and Knowledge Management) | Conference | IR, KM, and recsys |
| ACM TOIS, ACM TORS | Journals | Long-form recsys research |
The first RecSys conference was held on 19-20 October 2007 in Minneapolis, organized in part by the GroupLens group at the University of Minnesota; the conference is now the field's de facto annual gathering.