Recommender System

AI Tools & Products Information Retrieval Machine Learning

35 min read

Updated Jun 20, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 20, 2026

Fact-checked

In review queue

Sources

33 citations

Revision

v3 · 6,982 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

A recommender system (also called a recommendation system or recommendation engine) is an information filtering system that predicts the preference a user would give to an item and uses those predictions to suggest items the user is likely to want, typically without the user issuing any explicit query. Recommender systems are the invisible backbone of much of the modern internet: they decide which movies appear on Netflix's home row, which products Amazon places under "customers also bought," which videos autoplay next on YouTube, which songs land in a Spotify Discover Weekly playlist, which short clips appear in a TikTok For You feed, and which people LinkedIn suggests you might know. Their commercial weight is large: Netflix engineers reported in 2015 that roughly 2 of every 3 hours streamed on the service are discovered through recommendations, and that personalization and recommendation together save the company more than 1 billion dollars per year by reducing subscriber churn.^[31] Without recommendation, large catalogs would be unusable; with it, the same catalog feels personally curated for every visitor.

This article surveys the field from its origins in the early 1990s through the matrix factorization era of the Netflix Prize, the deep learning revolution of the late 2010s, and the ongoing transition toward generative recommenders built on transformer foundation models. It covers the major algorithmic families, the dominant industrial systems, evaluation methodology, and the open research challenges around cold start, fairness, filter bubbles, and exploration.

What is a recommender system?

In the most general formulation, a recommender takes a set of users U, a set of items I, and a (typically sparse) set of observed interactions or ratings, and produces for any given user u a ranked list of items from I that the user has not yet seen. The interactions can be explicit (a five-star rating, a thumbs up, a like) or implicit (a click, a watch, a purchase, the time spent on a page). Implicit feedback dominates most modern systems because it is abundant, cheap to collect, and closer to actual user behavior, but it is also noisier and only positive: a user not clicking on a video does not necessarily mean they would dislike it, only that they did not see it or did not act on it.^[9]

The recommendation task is closely related to but distinct from search. In search the user provides an explicit query and the system retrieves matching items, while in recommendation the system must infer what the user wants without an explicit query. Recommenders are also distinct from pure ranking systems used in advertising, although the underlying machine learning techniques heavily overlap.

A modern industrial recommender is rarely a single model. It is usually a multi-stage pipeline that first retrieves a few hundred or few thousand candidate items from a catalog of millions or billions, then ranks those candidates with a more expensive model, and finally re-ranks them to satisfy diversity, fairness, and business constraints before showing the result to the user.^[22]

Why do recommender systems matter?

Recommendation is one of the highest-leverage applications of machine learning in industry because it directly governs what billions of people see and buy. A widely cited 2013 McKinsey analysis estimated that 35 percent of what consumers purchase on Amazon and 75 percent of what they watch on Netflix come from algorithmic product recommendations.^[32] Netflix's own engineers gave a more conservative internal figure: in their 2015 description of the production system they wrote that the homepage "is the main presentation of recommendations, where 2 of every 3 hours streamed on Netflix are discovered."^[31]

The scale and the stakes are easy to underestimate. As of 2015 Netflix already had "more than 65 million members who stream more than 100 million hours of movies and TV shows per day," served through a homepage of roughly 40 rows with up to 75 videos per row.^[31] The design pressure is acute because, as the same paper observed, "a typical Netflix member loses interest after perhaps 60 to 90 seconds of choosing, having reviewed 10 to 20 titles (perhaps 3 in detail) on one or two screens."^[31] A good recommender has only seconds and a handful of slots to find something compelling.

The payoff is measured in retention and revenue, not just clicks. Netflix reported that personalization raises the "effective catalog size" (a measure of how broadly viewing is spread across the catalog) to roughly 4 times that of a popularity-only baseline, and concluded that "the combined effect of personalization and recommendations save us more than 1 billion dollars per year" through lower subscription cancellations.^[31] These figures explain why every large consumer platform treats its recommender stack as core infrastructure rather than a feature.

History

When did recommender systems begin? (1992 to 1997)

The phrase "collaborative filtering" was coined in 1992 by David Goldberg and colleagues at Xerox PARC for a system called Tapestry, which let employees of a small workgroup attach annotations to email messages and then filter their own inboxes based on the annotations of trusted colleagues.^[3] Tapestry was small, manual, and required users to know each other, but it established the principle that one person's reactions could be useful signals for another.

The first system that combined collaborative filtering with the open scale of the internet was GroupLens, presented at CSCW 1994 by Paul Resnick, Neophytos Iacovou, Mitesh Suchak, Peter Bergstrom, and John Riedl.^[1] GroupLens recommended Usenet netnews articles. News reader clients displayed predicted scores for each article and let users rate articles after reading them; rating servers called "Better Bit Bureaus" gathered the ratings and predicted scores using the heuristic that people who agreed in the past would probably agree again.^[1] Users could rate under pseudonyms to protect privacy, and the architecture was deliberately open so that alternative clients and rating servers could interoperate. GroupLens is widely regarded as the first true recommender system in the modern sense, and its descendants include the long-running MovieLens research dataset and the GroupLens research group at the University of Minnesota.

In the same year Will Hill, Larry Stead, Mark Rosenstein, and George Furnas at Bellcore published a system called Ringo (later "Firefly") that recommended music albums and artists, and the Pattie Maes lab at MIT built related social information filtering systems. The 1997 special issue of Communications of the ACM edited by Paul Resnick and Hal Varian, titled simply "Recommender Systems," introduced the term to a broad computer science audience and is generally cited as the moment the field acquired its name.^[2] Joseph Konstan, John Riedl, and others contributed founding papers throughout the late 1990s that established the algorithmic vocabulary still in use today.

Amazon item-to-item collaborative filtering: 1998

In 1998 Amazon filed a patent for "Collaborative recommendations using item-to-item similarity mappings," and the technique had already been running in production on Amazon.com for a year by the time Greg Linden, Brent Smith, and Jeremy York described it publicly in a 2003 IEEE Internet Computing paper.^[4] The paper has been hugely influential and was honored as the most important paper in the journal's history in 2017.^[28]

The key insight was that user-to-user collaborative filtering, which finds the most similar users to a target user and recommends what they liked, scales poorly: with millions of customers and millions of items, online lookup is too expensive. Amazon's item-to-item algorithm flipped the problem. It precomputed a similar-items table offline by counting which items were frequently purchased together, normalized for item popularity. At serve time, for each item in a user's history the system fetched the precomputed list of similar items and combined them. The online cost depended only on the size of the user's history, not on the size of the user base, which made the technique tractable at Amazon's scale and gave the world the iconic "Customers who bought this also bought" widget.^[4]

What was the Netflix Prize? (2006 to 2009)

In October 2006 Netflix announced a public competition with a one million dollar grand prize for any team that could improve the accuracy of its in-house Cinematch rating predictor by ten percent on a held-out test set, measured by root mean squared error. Netflix released a training set of about one hundred million movie ratings from roughly half a million anonymized users on about seventeen thousand movies. The prize ran for almost three years.^[6]

A pseudonymous blog post in December 2006 by Simon Funk (Brandyn Webb) titled "Netflix Update: Try This at Home" laid out a streamlined matrix factorization technique that became known as Funk SVD.^[5] Despite the name, it was not a true singular value decomposition. Funk treated the user-item rating matrix as the product of a tall user matrix and a wide item matrix of low rank, and learned the latent factors by stochastic gradient descent only on the observed entries, with regularization to prevent overfitting on the sparse signal.^[5] The post was technically informal, written conversationally, and put Funk in third place on the leaderboard. It also redirected the entire field toward latent-factor models. Almost every subsequent strong solution was built on top of Funk's basic recipe.^[6]

The winning team, BellKor's Pragmatic Chaos, was a 2009 merger of three previously competing teams. It included Yehuda Koren (then at Yahoo Research, formerly AT&T), Robert Bell and Chris Volinsky from AT&T Labs, Andreas Toscher and Michael Jahrer from Commendo Research in Austria (originally team BigChaos), and Martin Piotte and Martin Chabbert from Pragmatic Theory in Quebec.^[7]^[8] They achieved a test RMSE of 0.8567, a 10.06 percent improvement over Cinematch, and the one million dollar grand prize was awarded in New York City on September 21, 2009.^[33] Their final submission blended more than one hundred individual models using gradient boosted decision trees as the meta learner.^[7] The dominant components were enriched matrix factorization variants (timeSVD++ which modeled how user tastes drift across time, and asymmetric SVD which used the set of items a user had rated as features), restricted Boltzmann machines (RBMs) for collaborative filtering, and item-item neighborhood models that captured residuals the latent-factor models missed.^[6]^[7] A simultaneous tied submission by team "The Ensemble" reached the same accuracy thirty minutes too late.

The prize is often credited with three lasting effects. First, it normalized matrix factorization as the default collaborative filtering technique for the next decade.^[6] Second, it popularized large-scale ensembling and stacking. Third, somewhat paradoxically, it convinced Netflix that explicit five-star rating prediction was less valuable than predicting watch behavior, and the Cinematch successor that eventually launched used implicit signals rather than the prize-winning RMSE objective. Netflix's own engineers later confirmed this pivot, writing that "the days when stars and DVDs were the focus of recommendations at Netflix have long passed."^[31] Netflix never ran a second prize, in part because the released anonymized data was famously re-identified by Arvind Narayanan and Vitaly Shmatikov in 2008, raising serious privacy concerns.^[25]

Deep learning era: 2016 to 2020

By the mid 2010s deep learning had transformed computer vision and natural language processing, and recommender systems followed. The transition was anchored by a series of influential papers from major industry labs.

Wide & Deep Learning for Recommender Systems, by Heng-Tze Cheng and colleagues at Google in 2016, introduced a hybrid architecture that combined a wide linear model (good at memorizing specific feature crosses such as "installed app A and impression for app B") with a deep neural network tower (good at generalizing through learned dense embeddings of sparse categorical features).^[11] The two halves were trained jointly with a single logistic loss. Wide & Deep was deployed in the Google Play app store, which served over a billion active users and a million apps, and online A/B tests showed a 3.9 percent lift in app acquisitions over a deep-only baseline.^[11] The paper's open-source TensorFlow implementation made the architecture a default starting point across the industry.

DeepFM, by Huifeng Guo and colleagues at Huawei Noah's Ark Lab in 2017, addressed a perceived weakness of Wide & Deep, namely that the wide side still required hand-engineered cross features.^[13] DeepFM replaced the wide linear model with a Factorization Machine that learned all order-2 feature interactions automatically, sharing the same embedding lookup with the deep tower.^[13] This eliminated the need for feature crossing by hand and made the model easier to deploy in new domains.

Neural Collaborative Filtering (NCF), by Xiangnan He and colleagues at the National University of Singapore in 2017 (WWW conference), reframed classical matrix factorization as a special case of a neural network.^[14] NCF replaced the inner product between user and item embeddings with a multi-layer perceptron that could learn arbitrary interaction functions. The paper proposed three concrete instantiations: Generalized Matrix Factorization (GMF), Multi-Layer Perceptron (MLP), and Neural Matrix Factorization (NeuMF) which fused the two.^[14] NCF became the canonical introduction to deep collaborative filtering for thousands of researchers, although later replication studies (Rendle and colleagues at Google in 2020) showed that a well-tuned classical matrix factorization could often match or beat NCF on the same benchmarks, sparking a productive debate about evaluation rigor in the field.^[26]

Deep Interest Network (DIN), by Guorui Zhou and colleagues at Alibaba in 2018 (KDD), targeted click-through rate prediction in display advertising.^[15] DIN observed that users have many diverse interests reflected in long behavior histories, but most prior models compressed those histories into a fixed-length embedding and lost information. DIN introduced a local activation unit, essentially an attention mechanism, that re-weighted the user's historical items conditioned on the candidate ad being scored.^[15] Users interested in both books and electronics would have their book history activated when scoring a book ad and their electronics history activated when scoring an electronics ad. DIN was deployed across Alibaba's main display advertising system on training data with over two billion samples and lifted CTR by up to 10 percent and revenue per mille by 3.8 percent.^[15] It was a clear early demonstration that attention belonged in recommender systems.

Sampling-bias-corrected neural modeling for large corpus item recommendations by Xinyang Yi and colleagues at Google in 2019 (RecSys) formalized the two-tower architecture for large-scale retrieval.^[16] One neural network tower encodes the user (or query) and a second tower encodes the item. The two towers produce dense embeddings in a shared vector space, and similarity is computed by inner product. This separation allows the item tower to be precomputed and indexed in an approximate nearest-neighbor index, reducing serving cost from millions of forward passes to a single user-tower forward pass plus a logarithmic ANN lookup. The 2019 paper's key technical contribution was a streaming algorithm for estimating item frequency on the fly so that in-batch negative sampling could be debiased against power-law popularity, which is essential when the training distribution skews heavily toward head items.^[16] The system was deployed for YouTube candidate retrieval over a corpus of tens of millions of videos and shipped measurable engagement improvements in live A/B tests.^[16] The two-tower model is now the workhorse retrieval architecture across the industry.

Sequential recommendation, the task of predicting the next item from a user's recent action history, was transformed by attention. SASRec (Self-Attentive Sequential Recommendation) by Wang-Cheng Kang and Julian McAuley at UC San Diego (ICDM 2018) used a unidirectional transformer decoder, similar to GPT in its causal masking, to model user sequences.^[17] SASRec outperformed prior RNN and CNN-based sequential recommenders and was an order of magnitude faster to train.^[17] BERT4Rec by Fei Sun and colleagues at Alibaba (CIKM 2019) replaced the unidirectional decoder with a bidirectional transformer encoder trained with a masked-item Cloze objective borrowed from BERT, gaining accuracy on standard benchmarks.^[18] Together SASRec and BERT4Rec established the transformer as a first-class architecture for recommendation.

Industrial scale and the modern stack: 2018 onward

In parallel with the academic deep learning wave, several papers from large platforms documented the engineering systems behind real recommenders.

Pinterest published PinSAGE by Rex Ying and colleagues at Stanford and Pinterest (KDD 2018), the first graph neural network deployed at web scale.^[19] PinSAGE ran graph convolutions over a graph of three billion pins and boards with eighteen billion edges, sampling neighborhoods via random walks rather than enumerating them.^[19] The deployed system improved offline metrics, user studies, and online engagement, and it remained a standard reference architecture for graph-based item embedding for years.

Facebook (now Meta) open-sourced DLRM (Deep Learning Recommendation Model) by Maxim Naumov and colleagues in 2019.^[20] DLRM was a clean reference design that made explicit how an industrial CTR model handles the mix of dense numeric features and high-cardinality categorical features. Each categorical feature is mapped to an embedding table; dense features are processed through a bottom MLP; the embeddings and dense representation are interacted via pairwise dot products in the spirit of a factorization machine; and a top MLP produces the final prediction.^[20] DLRM also showed how to parallelize the embedding tables across many GPUs (model parallelism) while keeping the MLPs in data-parallel mode.^[20] The largest production variants at Meta have hundreds of billions of parameters dominated by enormous embedding tables.

ByteDance described Monolith in 2022, the recommendation system behind TikTok and Douyin.^[21] Monolith introduced a collisionless embedding table that uses cuckoo hashing to give every user and item ID a unique embedding rather than sharing slots through hashing, paired with frequency filtering and embedding expiration to control memory growth.^[21] The system was designed for online learning so that fresh user signals could update the model on a 30-minute or even shorter cadence, which is closer to the real-time nature of TikTok's tight feedback loop than batch retraining.^[21] Monolith was a major piece of public evidence that real-time training had become a practical requirement for video feed recommenders.

Netflix has continued to publish on its multi-task ranker, which jointly predicts watch, completion, save, and other engagement signals; on its causal and counterfactual evaluation methods; and most recently on its transition toward a transformer-based foundation model that ingests all user interactions as tokens. The production system as of 2015 was already "not one algorithm, but rather a collection of different algorithms serving different use cases that come together to create the complete Netflix experience," including the personalized video ranker, a Top N ranker, a Trending Now ranker, a Continue Watching ranker, and a video-video similarity ("sims") algorithm, all assembled by a page-generation algorithm.^[31] Netflix's 2025 PRS workshop materials describe replacing more than thirty specialized models with a unified architecture, with reported double-digit improvements in completion rates for niche titles.^[29]

LinkedIn's People You May Know (PYMK) is a representative people-recommendation system. It uses a multi-stage pipeline with three families of candidate generators (graph-based connections-of-connections, similarity-based on profile attributes, and heuristic-based on shared signals such as workplace or school), an L0 ranker that selects a few thousand from billions, a heavier deep ranker, and finally re-rankers for fairness, diversity, and avoidance of overrepresented power users. PYMK processes hundreds of terabytes of data daily.

Spotify combines collaborative filtering at the playlist level (two songs are similar if users place them on the same playlist), convolutional audio-feature analysis on raw spectrograms, and natural language processing over lyrics, blog posts, and user-generated playlist names. Discover Weekly, launched in 2015, is the most public face of this stack and now reaches tens of millions of users every Monday.

The most recent direction is generative recommendation, in which the recommender is a transformer-style sequence model that directly emits the next item ID rather than scoring a candidate set.^[30] Meta's HSTU (Hierarchical Sequential Transduction Units) and Netflix's foundation model are early production examples.^[29]^[30] The shift mirrors the broader transition in machine learning from many specialized supervised models to a few large foundation models, and it brings recommendation closer to the techniques used in retrieval augmented generation and large language model post-training.

What are the main types of recommender systems?

Recommender techniques are traditionally grouped into four broad families. Most production systems are hybrids that mix more than one.^[22]

Family	Core idea	Typical inputs	Strengths	Weaknesses
Collaborative filtering	Learn from co-occurrence of user-item interactions	User-item interaction matrix	Strong when interaction data is abundant; learns latent taste structure	Cold start; struggles with new users or new items
Content-based	Recommend items similar to those the user liked, by item features	Item attributes, user profile	Works for new items; explainable	Tends to recommend more of the same; needs good features
Knowledge-based	Match items to user requirements via explicit rules or constraints	Item attributes, user-stated needs	Works without history; handles complex requirements	Requires curated knowledge; less personalized over time
Hybrid	Combine two or more of the above	All of the above	Best of each family; mitigates cold start	More complex to build and tune

What is collaborative filtering?

Collaborative filtering (CF) is the most widely studied family. It assumes that users who agreed in the past will agree in the future and that items that have been liked by similar users are similar to each other.

User-based CF computes similarity between users (often by Pearson correlation or cosine similarity over their rating vectors) and recommends items that the most similar users liked. It was the original approach in GroupLens and Ringo.^[1] It scales poorly when the user base is large and the rating matrix is sparse, because finding the k nearest neighbors requires comparing against everyone.

Item-based CF, the Amazon variant, computes similarity between items based on which users co-interacted with them. The similar-items table can be precomputed offline, and online recommendation is a fast lookup over the user's recent items.^[4] Item-based CF dominated the early 2000s.

Matrix factorization generalizes both by learning low-dimensional latent vectors for users and items such that the inner product approximates the observed ratings.^[6] Funk SVD was the breakthrough form.^[5] Variants such as biased matrix factorization, SVD++, timeSVD++, asymmetric SVD, weighted matrix factorization for implicit feedback (the Hu, Koren, and Volinsky 2008 paper), and Bayesian Personalized Ranking (BPR) by Steffen Rendle and colleagues in 2009 dominated the post-Netflix-Prize era and remain strong baselines.^[9]^[10]

Neural collaborative filtering replaces the inner product with a learned function (an MLP, an attention layer, a transformer block) so that more expressive interaction patterns can be captured.^[14]

Content-based filtering

Content-based methods build profiles of items from their features (movie genres, actors, plot summary text; song acoustic features and lyrics; product category and description) and profiles of users from items they have engaged with. Recommendation reduces to matching the user profile to candidate items, often by cosine similarity in the feature space or by a learned classifier. Content-based methods naturally handle the new-item cold start because they do not need any interaction history for a new item, and they tend to be more transparent because the recommended item shares features the user has demonstrably liked.

Knowledge-based and constraint-based

Knowledge-based recommenders are common in domains where users buy items rarely (cars, houses, financial products) and where explicit requirements matter more than long-run preference signals. The system encodes domain knowledge (a sedan with at least four seats, a mortgage rate below five percent, a vacation in Europe under a budget) and either retrieves matching items (case-based) or solves a constraint satisfaction problem (constraint-based).^[22] These systems are often interactive, refining recommendations through dialog.

Hybrid systems

A hybrid recommender combines two or more techniques to overcome the weaknesses of any single approach. Robin Burke's 2002 taxonomy distinguished weighted, switching, mixed, feature combination, cascade, and meta-level hybrids.^[23] Most modern industrial systems are hybrids almost by default: a two-tower model uses both collaborative signals (interaction history) and content features (text embeddings, item categories), a sequential transformer mixes interaction order with item content, and a multi-stage pipeline routes through several different model families before serving.

Industrial systems

The table below summarizes some of the most influential or publicly documented industrial recommenders. Each has shaped a generation of follow-on work.

System	Company	Year	Domain	Key technique	Notable contribution
GroupLens	University of Minnesota	1994	Usenet news	User-based collaborative filtering	First public collaborative filtering recommender
Ringo / Firefly	MIT	1995	Music	User-based CF with social graph	Early consumer-facing recommender
Item-to-item CF	Amazon	1998	E-commerce	Item-item similarity	Made CF tractable at web scale
Cinematch	Netflix	2000	Movies	Pearson similarity, later matrix factorization	Catalyst for the Netflix Prize
Funk SVD	Simon Funk	2006	Movie ratings	Latent-factor matrix factorization via SGD	Defined the matrix-factorization era
BellKor Pragmatic Chaos	Joint team	2009	Movies	Ensemble of MF, RBM, kNN	Won the Netflix Prize
YouTube DNN	YouTube / Google	2016	Video	Two-stage candidate gen + ranking DNN	Showed deep learning at YouTube scale
Wide & Deep	Google Play	2016	App store	Hybrid wide linear + deep tower	Industry standard CTR architecture
DeepFM	Huawei	2017	Ads	FM + deep tower with shared embeddings	Removed manual feature crossing
Neural CF	NUS	2017	Implicit feedback	MLP over user and item embeddings	Popularized neural recommendation
DIN	Alibaba	2018	Display ads	Attention over user behavior history	First major attention recommender
PinSAGE	Pinterest	2018	Image discovery	Sampled graph convolutions	First web-scale graph recommender
SASRec	UCSD	2018	Sequential	Causal transformer	Made transformers default for sequences
BERT4Rec	Alibaba	2019	Sequential	Bidirectional transformer with Cloze	Bidirectional sequential recommender
Two-Tower retrieval	YouTube / Google	2019	Video retrieval	Dual-encoder with sampling-bias correction	Standard retrieval architecture
DLRM	Meta	2019	Ads, feed	Embedding tables + interaction + MLP	Open reference for industrial CTR
Monolith	TikTok / ByteDance	2022	Short video	Collisionless embeddings + online learning	Real-time training at scale
Foundation Model	Netflix	2024 to 2025	Video	Generative transformer over user tokens	Unified replacement for many specialized models

How does a multi-stage recommender pipeline work?

A single neural network cannot score every item in a catalog of hundreds of millions for every visit; it would be too slow and too expensive. Instead, modern recommenders use a multi-stage funnel.

Retrieval (or candidate generation) narrows the catalog from millions or billions to a few hundred or a few thousand candidates per request. Common retrieval methods include two-tower deep learning models with approximate nearest neighbor indexes, item-to-item co-visitation lookups, graph-based personalized PageRank, popularity baselines, and rule-based candidate sources for fresh content.^[12]^[16] A production system typically merges results from several retrievers running in parallel.

Ranking scores the retrieved candidates with a heavy model (often a DLRM-style architecture or a transformer-based interaction model) that uses many more features per item than retrieval can afford.^[20] Ranking objectives are usually multi-task, predicting click probability, watch time, completion, save, share, and other signals jointly.

Re-ranking applies business and policy logic on top of the ranker's scores. Examples include diversification across categories or creators, fairness constraints (boosting underrepresented sellers or content creators), freshness boosts for new content, deduplication of similar items, exploration injections to gather data on uncertain items, and slate optimization where the value of an item depends on the other items shown next to it.

The split between retrieval, ranking, and re-ranking is the central organizing principle of nearly every large-scale recommender stack from YouTube to TikTok to LinkedIn.^[12]

How are recommender systems evaluated?

Recommender systems can be evaluated offline on historical interaction logs or online in live traffic. Both are essential and they often disagree.

Offline metrics

Offline evaluation typically holds out the last interaction or the last day of interactions for each user and asks the model to predict it. Metrics fall into rating-prediction metrics and ranking metrics.

Metric	Definition	Captures
RMSE / MAE	Root mean squared error or mean absolute error of predicted vs actual ratings	Rating prediction accuracy; was the Netflix Prize objective
Precision@K	Fraction of the top K recommendations that are relevant	Top-list quality, position-agnostic
Recall@K	Fraction of all relevant items that appear in the top K	Coverage of relevant items
Hit Rate@K	Whether at least one relevant item appears in the top K	Binary success per user
MRR	Mean reciprocal rank of the first relevant item	Quick discovery of one good item
MAP@K	Mean average precision at K, averaged over users	Multiple relevant items, position-aware
NDCG@K	Normalized discounted cumulative gain at K, with logarithmic position discount	Ranking quality with graded relevance
AUC	Area under the ROC curve for predicting interaction vs non-interaction	Pairwise ranking accuracy
Coverage	Fraction of catalog ever recommended	Long-tail exposure
Diversity	Pairwise dissimilarity within recommendation lists	Variety
Novelty	Inverse popularity of recommended items	Surprise, escape from popularity bias
Serendipity	Recommendations that are both unexpected and useful	Pleasant surprise

NDCG and MAP are the most widely reported ranking metrics in the academic literature. They are rank-aware, meaning a relevant item near the top of the list contributes more than the same item near the bottom, which mirrors how users actually scan a result page.

Online evaluation

No offline metric perfectly correlates with the business outcome a recommender is meant to drive (revenue, engagement, retention, satisfaction), so industrial teams rely heavily on A/B testing, splitting users randomly between a control and one or more treatment models and measuring the difference in chosen primary metrics over a fixed window. Netflix engineers note that retention is the ultimate target but is hard to move, observing that "our retention rates are already high enough that it takes a very meaningful improvement to make a retention difference of even 0.1% (10 basis points)," so teams optimize toward medium-term engagement as a measurable proxy.^[31] Sophisticated teams layer A/B tests so that many independent experiments can run concurrently, use interleaving (mixing the rankings of two systems within a single result page) to gain statistical power, run holdout populations to measure long-term effects, and use causal inference and counterfactual estimators (such as inverse propensity weighting) when randomization is impossible.

Offline-online disagreement is a chronic problem. A model with higher offline NDCG often loses in A/B tests because it overfits to logged behavior, ignores presentation effects, encourages clickbait, or fails on fresh content the offline data cannot capture. Bridging this gap is an open research area.

Challenges

What is the cold start problem?

The cold start problem has three flavors. New-user cold start: a fresh visitor has no history, so collaborative filtering cannot find similar users. New-item cold start: a freshly uploaded item has no interactions, so collaborative filtering cannot place it. System cold start: a new product has neither users nor items at scale. Common mitigations include onboarding questionnaires that collect explicit preferences, demographic and contextual side information, content-based features that work on day one, hybrid models that fall back to content when collaborative signal is weak, and active learning that asks new users to rate a few carefully selected diagnostic items.^[22]

Filter bubbles and feedback loops

In his 2011 book The Filter Bubble, Eli Pariser warned that personalized algorithms were quietly enclosing each user in a unique informational bubble, hiding viewpoints they would not naturally encounter.^[24] Recommender systems amplify this through feedback loops: the model is trained on data the model itself helped generate, which over time can narrow the catalog of items shown, reinforce existing preferences, and reduce a user's exposure to diverse content. The empirical magnitude of filter-bubble effects is contested, with some studies finding modest effects and others finding strong ones depending on platform and definition. Mitigation techniques include explicit diversification in re-ranking, exploration policies that occasionally surface novel items, calibrated recommendations that match the user's expressed preference distribution rather than amplifying their dominant interest, and editorial overrides for important content categories.

Fairness, transparency, and ethics

Recommenders allocate attention, which has economic and social consequences. A music recommender may give established artists a structural advantage over new ones, a job-recommendation system may inadvertently steer protected groups away from certain roles, and a news feed may differentially expose political content. Modern fairness research distinguishes between fairness to users (consistent quality across groups) and fairness to providers (equitable exposure across creators or sellers). Re-ranking constraints, calibration losses, and exposure regularizers are all used to enforce fairness goals. The 2024 RecSys best paper, "You Don't Bring Me Flowers: Mitigating Unwanted Recommendations Through Conformal Risk Control," focuses on the related problem of avoiding actively unwanted recommendations.^[27]

Exploration and exploitation

A pure exploitation policy that always shows the items the model is most confident about will never gather data on items it is uncertain about, which compounds bias and starves new content. Recommenders therefore borrow heavily from the multi-armed bandit literature, especially contextual bandits that condition the action on user features. Common policies include epsilon-greedy, Thompson sampling, upper confidence bound, and learned exploration policies that propose novel items where the long-term value of the gathered information is high. Exploration is essential for catalog freshness, fairness to new creators, and robustness to distribution shift.

Latency, scale, and freshness

Industrial recommenders must respond in tens of milliseconds, on tail latency, while scoring against catalogs of millions or billions of items, and they must integrate fresh content within minutes. This requires approximate nearest neighbor indexes (FAISS, ScaNN, HNSW), sharded embedding tables, model-parallel training, online learning pipelines such as TikTok's Monolith, and edge caching of pre-scored slates.^[21] Engineering quality matters as much as model quality.

Privacy

Recommenders need behavioral data, which is among the most sensitive data a platform collects. The 2008 re-identification of the Netflix Prize dataset by Narayanan and Shmatikov demonstrated that even "anonymized" rating data can leak identity when joined to public sources.^[25] Modern privacy techniques applied to recommenders include differential privacy added to gradients during training, federated learning that keeps raw data on user devices and shares only model updates, on-device personalization for sensitive domains, and user controls that let people inspect and edit the inferred profile that drives their recommendations.

Adversarial robustness

Because recommendation slots have economic value, providers and adversaries try to manipulate them through fake accounts, sybil attacks, coordinated rating campaigns, prompt injection in conversational recommenders, and SEO-style content tuning. Detection and robust learning under attack is a permanent research and engineering problem.

Open research directions

The field is currently moving in several directions simultaneously.

Generative recommendation. Treating recommendation as autoregressive sequence generation (next-item-ID prediction by a transformer trained on user history) collapses the retrieval, ranking, and re-ranking stages into a single foundation model.^[30] Meta's HSTU and Netflix's foundation model are early production examples.^[29]^[30] A central open question is how to tokenize items so that the vocabulary is both compact and semantically meaningful.

LLM-augmented recommenders. Large language models can generate item descriptions, score candidates with rich reasoning, power conversational recommendation interfaces, and ground on collaborative signals through retrieval. Netflix's CRAG (Collaborative Retrieval Augmented Generation) is one published example.

Causal recommendation. Logged behavior is confounded by what was previously shown to the user. Causal estimators try to recover the effect of recommending an item from the effect of merely showing it next to other items, which improves both offline evaluation and policy learning.

Multi-stakeholder objectives. The field is increasingly framing recommendation as a multi-objective problem that explicitly weighs user satisfaction, creator livelihood, platform revenue, and societal externalities, rather than optimizing only short-term engagement.

Reinforcement learning for long-term value. Predicting the next click is myopic. RL-based recommenders optimize cumulative reward across a session or a subscription lifetime, with policies that may sacrifice an immediate click for a higher-value future engagement.

References

Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P., and Riedl, J. (1994). "GroupLens: an open architecture for collaborative filtering of netnews." *Proceedings of the 1994 ACM Conference on Computer Supported Cooperative Work*, pp. 175 to 186. ↩
Resnick, P., and Varian, H. R. (1997). "Recommender systems." *Communications of the ACM*, 40(3), pp. 56 to 58. ↩
Goldberg, D., Nichols, D., Oki, B. M., and Terry, D. (1992). "Using collaborative filtering to weave an information tapestry." *Communications of the ACM*, 35(12), pp. 61 to 70. ↩
Linden, G., Smith, B., and York, J. (2003). "Amazon.com recommendations: item-to-item collaborative filtering." *IEEE Internet Computing*, 7(1), pp. 76 to 80. ↩
Funk, S. (2006). "Netflix Update: Try This at Home." Blog post, sifter.org/~simon/journal/20061211.html. ↩
Koren, Y., Bell, R., and Volinsky, C. (2009). "Matrix factorization techniques for recommender systems." *IEEE Computer*, 42(8), pp. 30 to 37. ↩
Bell, R. M., Koren, Y., and Volinsky, C. (2009). "The BellKor solution to the Netflix Grand Prize." Netflix Prize documentation. ↩
Toscher, A., Jahrer, M., and Bell, R. M. (2009). "The BigChaos solution to the Netflix Grand Prize." Netflix Prize documentation. ↩
Hu, Y., Koren, Y., and Volinsky, C. (2008). "Collaborative filtering for implicit feedback datasets." *Proceedings of the 8th IEEE International Conference on Data Mining*, pp. 263 to 272. ↩
Rendle, S., Freudenthaler, C., Gantner, Z., and Schmidt-Thieme, L. (2009). "BPR: Bayesian Personalized Ranking from implicit feedback." *Proceedings of UAI 2009*. ↩
Cheng, H.-T., Koc, L., Harmsen, J., Shaked, T., Chandra, T., Aradhye, H., Anderson, G., Corrado, G., Chai, W., Ispir, M., et al. (2016). "Wide & Deep Learning for Recommender Systems." *Proceedings of the 1st Workshop on Deep Learning for Recommender Systems*, pp. 7 to 10. ↩
Covington, P., Adams, J., and Sargin, E. (2016). "Deep Neural Networks for YouTube Recommendations." *Proceedings of the 10th ACM Conference on Recommender Systems (RecSys)*, pp. 191 to 198. ↩
Guo, H., Tang, R., Ye, Y., Li, Z., and He, X. (2017). "DeepFM: A Factorization-Machine based Neural Network for CTR Prediction." *Proceedings of IJCAI 2017*, pp. 1725 to 1731. ↩
He, X., Liao, L., Zhang, H., Nie, L., Hu, X., and Chua, T.-S. (2017). "Neural Collaborative Filtering." *Proceedings of the 26th International Conference on World Wide Web (WWW)*, pp. 173 to 182. ↩
Zhou, G., Zhu, X., Song, C., Fan, Y., Zhu, H., Ma, X., Yan, Y., Jin, J., Li, H., and Gai, K. (2018). "Deep Interest Network for Click-Through Rate Prediction." *Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD)*, pp. 1059 to 1068. ↩
Yi, X., Yang, J., Hong, L., Cheng, D. Z., Heldt, L., Kumthekar, A., Zhao, Z., Wei, L., and Chi, E. (2019). "Sampling-bias-corrected neural modeling for large corpus item recommendations." *Proceedings of the 13th ACM Conference on Recommender Systems (RecSys)*, pp. 269 to 277. ↩
Kang, W.-C., and McAuley, J. (2018). "Self-Attentive Sequential Recommendation." *Proceedings of IEEE International Conference on Data Mining (ICDM)*, pp. 197 to 206. ↩
Sun, F., Liu, J., Wu, J., Pei, C., Lin, X., Ou, W., and Jiang, P. (2019). "BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer." *Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM)*, pp. 1441 to 1450. ↩
Ying, R., He, R., Chen, K., Eksombatchai, P., Hamilton, W. L., and Leskovec, J. (2018). "Graph Convolutional Neural Networks for Web-Scale Recommender Systems." *Proceedings of KDD 2018*, pp. 974 to 983. ↩
Naumov, M., Mudigere, D., Shi, H.-J. M., Huang, J., Sundaraman, N., Park, J., et al. (2019). "Deep Learning Recommendation Model for Personalization and Recommendation Systems." arXiv:1906.00091. ↩
Liu, Z., Zou, L., Zou, X., Wang, C., Zhang, B., Tang, D., Zhu, B., Zhu, Y., Wu, P., Wang, K., and Cheng, Y. (2022). "Monolith: Real Time Recommendation System With Collisionless Embedding Table." arXiv:2209.07663. ↩
Ricci, F., Rokach, L., and Shapira, B. (eds.) (2022). *Recommender Systems Handbook*, 3rd edition. Springer, New York. ↩
Burke, R. (2002). "Hybrid recommender systems: Survey and experiments." *User Modeling and User-Adapted Interaction*, 12(4), pp. 331 to 370. ↩
Pariser, E. (2011). *The Filter Bubble: What the Internet Is Hiding from You*. Penguin Press. ↩
Narayanan, A., and Shmatikov, V. (2008). "Robust de-anonymization of large sparse datasets." *Proceedings of the 2008 IEEE Symposium on Security and Privacy*, pp. 111 to 125. ↩
Rendle, S., Krichene, W., Zhang, L., and Anderson, J. (2020). "Neural Collaborative Filtering vs. Matrix Factorization Revisited." *Proceedings of the 14th ACM Conference on Recommender Systems (RecSys)*, pp. 240 to 248. ↩
De Toni, G., Purificato, E., Gomez, E., Passerini, A., Lepri, B., and Consonni, C. (2024). "You Don't Bring Me Flowers: Mitigating Unwanted Recommendations Through Conformal Risk Control." *Proceedings of the 18th ACM Conference on Recommender Systems (RecSys 2024)*. Best Paper Award. ↩
Smith, B., and Linden, G. (2017). "Two decades of recommender systems at Amazon.com." *IEEE Internet Computing*, 21(3), pp. 12 to 18. ↩
Netflix Research (2025). "Foundation Model for Large-Scale Personalized Recommendation." Netflix Personalization, Recommendations and Search Workshop. ↩
Zhai, J., Liao, L., Liu, X., Wang, Y., Li, R., Cao, X., Gao, L., Gong, Z., Gu, F., He, J., Liu, Y., and Lu, J. (2024). "Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations." *Proceedings of ICML 2024*. ↩
Gomez-Uribe, C. A., and Hunt, N. (2015). "The Netflix Recommender System: Algorithms, Business Value, and Innovation." *ACM Transactions on Management Information Systems (TMIS)*, 6(4), Article 13, pp. 1 to 19. ↩
McKinsey & Company (2013). "How retailers can keep up with consumers." McKinsey insights, October 2013. ↩
National Institute of Statistical Sciences (2009). "Bell and Volinsky Part of AT&T Team that Won $1 Million from Netflix Prize Contest." NISS news, September 2009. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

Recommender System

What is a recommender system?

Why do recommender systems matter?

History

When did recommender systems begin? (1992 to 1997)

Amazon item-to-item collaborative filtering: 1998

What was the Netflix Prize? (2006 to 2009)

Deep learning era: 2016 to 2020

Industrial scale and the modern stack: 2018 onward

What are the main types of recommender systems?

What is collaborative filtering?

Content-based filtering

Knowledge-based and constraint-based

Hybrid systems

Industrial systems

How does a multi-stage recommender pipeline work?

How are recommender systems evaluated?

Offline metrics

Online evaluation

Challenges

What is the cold start problem?

Filter bubbles and feedback loops

Fairness, transparency, and ethics

Exploration and exploitation

Latency, scale, and freshness

Privacy

Adversarial robustness

Open research directions

See also

References

Improve this article

What links here (24 of 73)

What links here (24 of 73)

What is a recommender system?

Why do recommender systems matter?

History

When did recommender systems begin? (1992 to 1997)

Amazon item-to-item collaborative filtering: 1998

What was the Netflix Prize? (2006 to 2009)

Deep learning era: 2016 to 2020

Industrial scale and the modern stack: 2018 onward

What are the main types of recommender systems?

What is collaborative filtering?

Content-based filtering

Knowledge-based and constraint-based

Hybrid systems

Industrial systems

How does a multi-stage recommender pipeline work?

How are recommender systems evaluated?

Offline metrics

Online evaluation

Challenges

What is the cold start problem?

Filter bubbles and feedback loops

Fairness, transparency, and ethics

Exploration and exploitation

Latency, scale, and freshness

Privacy

Adversarial robustness

Open research directions

See also

References

Improve this article

Related Articles

Search Engine

You.com

Tavily

Consensus (academic AI search)

Semantic Scholar

Elicit (research tool)

What links here (24 of 73)

Related Articles

Search Engine

You.com

Tavily

Consensus (academic AI search)

Semantic Scholar

Elicit (research tool)

What links here (24 of 73)