Collaborative filtering

Collaborative filtering (CF) is a family of techniques used in recommendation systems that predicts a user's preferences by collecting and analyzing preference signals from many other users. The core assumption is the wisdom of the crowd: if two users have agreed on the quality of certain items in the past, they are likely to agree on other items in the future. Collaborative filtering ignores the actual content of items and works only from observed user-item interactions, which means a single CF algorithm can power recommendations for movies, songs, products, news articles, or research papers without any domain-specific feature engineering.

Collaborative filtering is one of the most widely deployed approaches in machine learning for personalization. It powers feeds and homepages at companies including Netflix, Amazon, Spotify, and YouTube, and it remains a foundational building block in modern industrial recommendation pipelines, even after the rise of large neural and graph-based models.

Explain like I'm 5 (ELI5)

Imagine you and your friend both love the same cartoons. One day your friend watches a new cartoon and says it is great. Even though you have never seen it, you would probably like it too, because you and your friend have the same taste. Collaborative filtering works the same way. It looks at what lots of people watch or buy, finds the people who like the same things you like, and then suggests things those people enjoyed that you have not tried yet.

Definition and core intuition

A collaborative filter takes as input a sparse user-item interaction matrix R of size m by n, where m is the number of users and n is the number of items. Each entry R[u, i] holds either an explicit rating (for example a 1 to 5 star score), an implicit signal (a click, watch, purchase, or play count), or is missing because user u has not interacted with item i. In a typical real-world catalog more than 99% of entries are missing.

The goal of CF is to fill in the missing entries, or equivalently to rank the unseen items for each user from most to least likely to be enjoyed. The algorithm does this by exploiting two regularities in R:

Users who have interacted similarly with the same items in the past tend to keep agreeing.
Items that have been interacted with by similar groups of users tend to share latent properties.

Neither regularity needs item descriptions or user profiles to work. CF is therefore considered a content-agnostic method, which is what distinguishes it from content-based filtering.

Historical background

Collaborative filtering as a named idea is just over thirty years old, but its roots in social information filtering go back much further. The modern form began in the early 1990s and matured into a major academic and industrial subfield through three landmark events: the GroupLens project, Amazon's item-to-item algorithm, and the Netflix Prize.

Tapestry (1992)

The term collaborative filtering was coined by David Goldberg, David Nichols, Brian Oki, and Douglas Terry at Xerox PARC in their 1992 Communications of the ACM paper "Using collaborative filtering to weave an information tapestry." Tapestry was an experimental email and document-filtering system designed to handle the growing flood of internal Xerox mail and Usenet news. Users could annotate any document with reactions, for example marking a memo as "interesting" or replying to a thread. Other users wrote queries that referenced these annotations, such as "show me messages that Smith replied to." Tapestry required users to manually name the people whose opinions they trusted, which limited its scale, but it introduced the central idea that human reactions, not just document contents, can guide automated filtering.

GroupLens (1994)

In 1994, Paul Resnick, Neophytos Iacovou, Mitesh Suchak, Peter Bergstrom, and John Riedl, working at the University of Minnesota and MIT, introduced GroupLens at the ACM Conference on Computer Supported Cooperative Work. GroupLens applied automated CF to Usenet netnews, where the volume of articles already exceeded what any individual could read. News-reader clients displayed predicted scores for each article and let users rate articles after reading them. Servers called "Better Bit Bureaus" gathered ratings, computed user similarities (initially with Pearson correlation), and pushed predictions back to clients. GroupLens was the first system to demonstrate that CF could scale to open communities of strangers and the first to formalize the user-based neighborhood algorithm that became a standard textbook reference.

Ringo, Firefly, and the late 1990s

Almost simultaneously, Upendra Shardanand and Pattie Maes at the MIT Media Lab introduced Ringo (1995), a CF system for music recommendations. Ringo evolved into the commercial venture Firefly Network, which Microsoft acquired in 1998. By the late 1990s, CF was being explored by online retailers, news aggregators, and early social networking sites. Research labs at GroupLens, MIT Media Lab, and Compaq's Systems Research Center compared similarity functions, neighborhood sizes, and aggregation rules, producing the toolkit of memory-based methods still in use today.

Amazon item-to-item (2003)

Amazon was an early commercial adopter of collaborative filtering and is responsible for one of the most influential algorithmic changes in the field. In a 2003 IEEE Internet Computing paper, Greg Linden, Brent Smith, and Jeremy York described Amazon's switch from user-based CF to item-to-item collaborative filtering. Instead of finding users who looked like the active customer and recommending what those users had bought, Amazon precomputed an item-to-item similarity table offline, then at request time looked up similar items for everything in the customer's history. The change made recommendations much faster, scaled to tens of millions of customers, and produced higher-quality lists. In 2017 the IEEE Internet Computing editorial board named the paper the single best "test of time" article from its first twenty years of publication.

The Netflix Prize (2006-2009)

In October 2006, Netflix released a 100 million rating dataset and offered a one million dollar prize to anyone who could improve the RMSE of its Cinematch system by 10 percent. Over 40,000 teams from 186 countries entered. The competition popularized matrix factorization as the dominant CF approach: Simon Funk's blog post in December 2006 describing a stochastic gradient descent variant of singular value decomposition (now known as Funk SVD) reset the leaderboard and inspired hundreds of derivative models. Yehuda Koren's SVD++ added implicit feedback signals, and timeSVD++ added temporal dynamics.

The prize was awarded on 21 September 2009 to BellKor's Pragmatic Chaos, a merged team consisting of Yehuda Koren and Robert Bell of AT&T Labs, Chris Volinsky of AT&T Labs, Andreas Toscher and Michael Jahrer of Commendo Research, and Martin Piotte and Martin Chabbert of Pragmatic Theory. Their winning solution, with a test RMSE of 0.8567, was a blend of more than 100 individual models combining matrix factorization, restricted Boltzmann machines, and neighborhood methods. Netflix later said it never put the full ensemble into production because of engineering complexity, but the techniques and the open dataset transformed academic recommender systems research for a decade.

Two main approaches

Collaborative filtering algorithms divide cleanly into two families. Memory-based methods keep the entire interaction matrix in memory and search it at prediction time. Model-based methods learn a compact predictive model offline and use it at prediction time. The same training data can be used by either family, and modern systems frequently combine both.

Aspect	Memory-based CF	Model-based CF
Prediction approach	Searches raw user-item data at prediction time	Uses a trained model to score user-item pairs
Training phase	None (lazy learning)	Required (offline training)
Prediction speed	Slower for large catalogs	Fast at prediction time
Handling sparsity	Struggles with very sparse data	More robust to sparsity
Interpretability	High (neighbors are explainable)	Lower (latent factors are opaque)
Scalability	Limited by pairwise similarity computation	Scales to billions of interactions
Update cost	Trivial (just append data)	Requires retraining or incremental updates
Cold start handling	Poor without auxiliary signals	Poor without auxiliary signals
Typical algorithms	User-KNN, item-KNN	Matrix factorization, neural CF, autoencoders, GNNs

Memory-based (neighborhood) methods

Memory-based CF, also called neighborhood-based CF, predicts a user's preference for an item by aggregating preferences from a chosen neighborhood. The neighborhood is selected using a similarity measure over the rows or columns of the interaction matrix. There are two flavors, distinguished by whether the neighborhood is over users or over items.

User-based collaborative filtering

User-based CF finds the set of users most similar to the active user, then predicts a rating for an unseen item as the similarity-weighted average of those neighbors' ratings. Formally, the predicted rating r-hat for user u and item i is:

r-hat(u, i) = mean(u) + sum over v in N(u, i) of sim(u, v) * (r(v, i) - mean(v)) / sum over v in N(u, i) of |sim(u, v)|

where N(u, i) is the neighborhood of users similar to u who have rated i, and mean(u) is u's average rating. Subtracting the mean corrects for the fact that some users habitually give higher or lower ratings than others.

User-based CF is intuitive and gives explanations like "users who rated these films the way you did also liked Item X," which is easy to display in a UI. Its weakness is computational: pairwise user similarity scales as O(m^2), and the user neighborhood drifts every time anyone rates a new item, so similarity needs frequent recomputation in production.

Item-based collaborative filtering

Item-based CF reverses the geometry. Instead of looking for similar users, the system precomputes an item-item similarity table and predicts ratings using items already seen by the user as the neighborhood. The predicted rating becomes:

r-hat(u, i) = sum over j in N(i, u) of sim(i, j) * r(u, j) / sum over j in N(i, u) of |sim(i, j)|

where N(i, u) is the set of items rated by u that are most similar to target item i. The advantages over user-based CF were quantified by Sarwar, Karypis, Konstan, and Riedl in their 2001 WWW paper. Item-item similarities change much less often than user-user similarities, since rating one new movie barely affects the similarity profile of a movie with millions of ratings. The similarity table can therefore be computed offline and reused, making the online prediction step extremely fast. Linden, Smith, and York's 2003 paper showed that item-based CF was the technology behind Amazon's iconic "Customers who bought this also bought" widget and could serve recommendations for tens of millions of users at sub-second latency.

Model-based methods

Model-based CF learns a parametric predictive model that captures the regularities in the interaction matrix. At prediction time, the system applies the model rather than searching the raw data. Model-based methods are more robust to sparsity, generally faster at serving, and easier to combine with side information, but they require an offline training stage and the learned representations are typically less interpretable than nearest-neighbor lists.

The dominant family of model-based methods is matrix factorization, which factors the interaction matrix into a low-rank product of a user matrix and an item matrix. Other classical model-based approaches include clustering-based CF, Bayesian networks, and restricted Boltzmann machines (which were a key ingredient of the Netflix Prize winning blend). Modern model-based CF is dominated by neural network architectures, including neural collaborative filtering, autoencoders, two-tower models, graph neural networks, and transformer-based sequential models.

Similarity metrics

Both user-based and item-based CF depend on a similarity measure to quantify how alike two rating vectors are. The choice of metric affects recommendation quality significantly, especially under sparsity.

Metric	Definition	Range	Common use
Cosine similarity	Cosine of the angle between two rating vectors, treated as vectors in an item-space (or user-space)	-1 to 1	Default for item-based CF; works well with implicit feedback
Pearson correlation	Linear correlation between two users' ratings, with each rating centered on the user's mean	-1 to 1	Default for user-based CF; corrects for user-specific rating biases
Adjusted cosine similarity	Cosine similarity computed after subtracting each user's mean rating from each rating	-1 to 1	Item-based CF when raters use the rating scale differently
Jaccard index	Size of intersection over size of union of two users' interacted-item sets	0 to 1	Implicit feedback or binary interactions
Spearman rank correlation	Pearson correlation applied to rating ranks rather than raw scores	-1 to 1	When rating distributions are skewed or non-linear
Euclidean distance	Straight-line distance between two rating vectors	0 to infinity	General-purpose; rarely the best CF choice

In practice, cosine and adjusted cosine are the workhorse metrics for item-based CF, while Pearson dominates user-based CF on explicit ratings. With implicit feedback, the Jaccard index and conditional probability ratios (such as Amazon's "customers who bought X also bought Y" formula) are common.

Matrix factorization in collaborative filtering

Matrix factorization is the most influential model-based approach in CF. The central idea is to decompose the sparse user-item rating matrix R into the product of two lower-dimensional dense matrices: a user matrix U of size m by k and an item matrix V of size n by k, where k (the number of latent factors) is much smaller than m or n. The predicted rating for user u and item i is the dot product:

r-hat(u, i) = U[u] dot V[i]

Fitting U and V to the observed ratings via regularized squared error reduces the recommendation problem to a relatively standard convex-by-blocks optimization. Matrix factorization is treated in detail in the dedicated matrix factorization article. The summary below covers the variants most relevant to CF.

Funk SVD (2006)

Simon Funk's blog post during the Netflix Prize popularized a stochastic gradient descent procedure that updates user and item factors one observed rating at a time, ignoring the millions of missing entries entirely. Despite its name, Funk SVD is not a true singular value decomposition (SVD requires a complete matrix) but a regularized matrix factorization optimized with SGD. The method was simple, fast, and dramatically more accurate than the neighborhood baselines. Funk finished third in the Netflix Prize and his approach became the basis for nearly every later matrix factorization variant.

SVD++ and timeSVD++ (2008)

Yehuda Koren's SVD++ extends Funk SVD by adding a second per-item factor that captures the influence of which items a user has rated, even if the rating value is unknown. This implicit signal often improves accuracy on top of the explicit signal alone. timeSVD++ goes a step further and lets user biases, item biases, and user factors drift over time, which captures effects such as users' tastes shifting and Netflix periodically re-calibrating its star scale. Both variants were core components of the BellKor and BellKor's Pragmatic Chaos solutions.

Non-negative matrix factorization (NMF)

NMF imposes the constraint that all entries of U and V are non-negative. The constraint produces parts-based representations that are easier to interpret, since a learned factor cannot subtract from another. NMF is well suited to non-negative data such as count matrices and is used in topic modeling and image decomposition as well as CF.

Probabilistic matrix factorization (PMF)

Probabilistic matrix factorization (Salakhutdinov and Mnih, 2007) gives matrix factorization a Bayesian formulation in which user and item factors have Gaussian priors and ratings are conditional Gaussians given the dot product. PMF generalizes well on sparse data and provides a principled way to add user-item side information through the priors. Bayesian PMF (BPMF), introduced the following year, removes the need to set hyperparameters by hand by sampling them with MCMC.

Alternating least squares (ALS)

ALS optimizes matrix factorization by alternately fixing one factor matrix and solving the other. With one matrix held constant, the loss is convex in the other and reduces to a regularized least-squares problem with a closed-form solution per row. The procedure is highly parallel: every user row update is independent, and every item row update is independent, so ALS scales naturally on distributed engines such as Apache Spark MLlib.

Implicit feedback ALS (Hu, Koren, Volinsky, 2008)

Hu, Koren, and Volinsky's 2008 ICDM paper extended ALS to implicit feedback. Their key insight is that for implicit data, every user-item pair is observed: an interaction means "some preference, some confidence" and a non-interaction means "some preference (probably negative), low confidence." The model treats the binary preference p and a confidence weight c separately, then minimizes the confidence-weighted squared error over the full m by n matrix using ALS with closed-form linear-algebra tricks that exploit the low rank structure. The algorithm became the default factorization for implicit data and is the basis for the open-source implicit Python library and the implicit-mode of Spark ALS. The paper won the IEEE ICDM 10-Year Highest-Impact Paper Award in 2017.

Implicit versus explicit feedback

Collaborative filtering systems differ in the type of feedback they consume.

Explicit feedback is direct, intentional input from users: star ratings, thumbs-up or thumbs-down, written reviews. It expresses preference clearly, but it is sparse: most users rate only a small fraction of items they interact with, and the act of rating itself selects for unusual reactions (very good or very bad).

Implicit feedback is inferred from observed behavior: clicks, watches, purchases, dwell time, plays, scroll depth, repeat visits. It is far more abundant than explicit feedback. Its weakness is interpretation: viewing a product page does not necessarily indicate a positive preference, and the absence of an interaction does not reliably indicate disinterest (the user may not have seen the item).

Feedback type	Examples	Advantages	Disadvantages
Explicit	Star ratings, likes, written reviews, thumbs up or down	Clear preference signal; easy to model	Very sparse; requires user effort; selection bias
Implicit	Clicks, purchases, watch time, listen counts, dwell time, scroll depth	Abundant; no extra user effort; richer behavioral coverage	Noisy; non-interaction is ambiguous; biased by exposure

Most large production recommenders today rely primarily on implicit feedback because of its abundance. Hu, Koren, and Volinsky's 2008 weighted-confidence ALS framework gave the field its first principled implicit-feedback factorization. Many systems still mix in explicit signals where available to disambiguate cases such as a long click followed by a quick close (negative implicit signal but a positive click).

Deep learning for collaborative filtering

The rise of deep learning after 2014 brought several neural architectures into the CF toolkit. The motivation was that classic matrix factorization assumes a linear interaction (a dot product) between user and item factors, while real preferences may depend non-linearly on combinations of factors and on side information that is hard to slot into a linear model.

Neural collaborative filtering (NCF, 2017)

Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua's 2017 paper at WWW introduced neural collaborative filtering (NCF). NCF replaces the dot product of matrix factorization with a feedforward neural network that learns an arbitrary user-item interaction function from data. The paper proposed three model variants: Generalized Matrix Factorization (GMF), which generalizes the standard inner product; Multi-Layer Perceptron (MLP), which feeds concatenated user and item embeddings into a stack of fully connected layers; and NeuMF, which fuses GMF and MLP and trains them jointly. NCF was evaluated on the MovieLens 1M and Pinterest datasets and significantly outperformed traditional MF on Hit Rate at 10 and NDCG at 10. NCF also normalized the practice of training CF models on implicit feedback with binary cross-entropy loss and negative sampling, which became standard.

A 2020 SIGIR paper by Rendle, Krichene, Zhang, and Anderson re-examined NCF and showed that with careful tuning, a regularized matrix factorization baseline could match or beat NeuMF on the same benchmarks. The exchange highlighted how brittle published CF benchmarks can be and contributed to the community's increased focus on rigorous evaluation.

Autoencoders for CF

Autoencoder-based CF treats each user as a sparse vector in item-space and trains an autoencoder to reconstruct it. AutoRec (Sedhain, Menon, Sanner, and Xie, 2015) was the first such model. CDAE (Wu, DuBois, Zheng, and Ester, 2016) added denoising. Mult-VAE (Liang, Krishnan, Hoffman, and Jebara, 2018) used a variational autoencoder with a multinomial likelihood and showed strong performance on MovieLens-20M and Netflix data. Autoencoder CF tends to outperform traditional MF on top-N recommendation, particularly on implicit data, because the multinomial likelihood is a better match for click counts than squared error.

Two-tower neural retrieval

Large-scale industry recommenders often need to rank a few thousand candidates out of a corpus of hundreds of millions of items in milliseconds. The dominant architecture for this candidate-generation step is the two-tower model: a user tower encodes user features and recent history into a vector, an item tower encodes item features into a vector, and a dot product (or cosine similarity) between the two vectors scores each user-item pair. Item vectors are precomputed and indexed for approximate nearest neighbor retrieval. The user vector is computed at request time and the system retrieves the K nearest item vectors.

The seminal industrial paper is Covington, Adams, and Sargin's 2016 RecSys paper "Deep Neural Networks for YouTube Recommendations," which described a two-stage architecture (deep candidate generation followed by deep ranking) serving billions of users. Yi et al.'s 2019 RecSys paper "Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations" extended the two-tower retrieval approach to YouTube's tens of millions of videos and introduced an in-batch sampling correction that adjusts for the over-representation of popular items in random batches. The paper's streaming-data frequency estimation algorithm is now standard in YouTube's neural retrieval system.

Graph neural networks for CF

Graph neural networks treat the user-item interaction matrix as a bipartite graph and learn representations by propagating information along edges. The first widely deployed industrial example was PinSage at Pinterest (Ying, He, Chen, Eksombatchai, Hamilton, Leskovec, and Rosenberg, 2018), trained on a graph of 3 billion pins and boards with 18 billion edges. PinSage uses random walks to sample importance-weighted neighborhoods and produces embeddings that combine graph structure with pin features (text, image, and category metadata).

Research GNN-CF models include NGCF (Wang, He, Wang, Feng, and Chua, 2019), which propagates user and item embeddings through several graph convolution layers, and LightGCN (He, Deng, Wang, Li, Zhang, and Wang, 2020), which simplifies NGCF by removing the feature transformation and non-linear activation and keeping only the neighborhood aggregation step. Despite its simplicity, LightGCN often matches or beats more complex GNN architectures on standard CF benchmarks and has become a strong baseline in academic papers.

Sequential and transformer-based CF

Sequential recommenders model the time-ordered sequence of items each user has interacted with and predict the next item. They overlap heavily with CF: the training signal still comes from observed user-item interactions, but the model learns transitions and conditional patterns rather than static user representations.

GRU4Rec (Hidasi, Karatzoglou, Baltrunas, and Tikk, 2016) used a gated recurrent network and was the first widely cited deep sequential recommender. SASRec (Kang and McAuley, 2018) replaced the RNN with a unidirectional transformer (similar to a small GPT) and shifted the field toward self-attention. BERT4Rec (Sun, Liu, Wu, Pei, Lin, Ou, and Jiang, 2019) used bidirectional self-attention with a masked-item Cloze objective, modeled on BERT, and became a strong sequential CF baseline. Industrial recommenders at TikTok, Meta, Pinterest, and YouTube use heavily customized variants of these transformer architectures, often with hundreds of millions of parameters.

Comparison with content-based filtering

Collaborative filtering and content-based filtering are the two classical paradigms for recommendation, and they have complementary strengths.

Aspect	Collaborative filtering	Content-based filtering
Required input	User-item interaction matrix	Item attribute features (text, tags, images, metadata)
Required user data	Interaction history with multiple items	History plus item attribute preferences
New user (cold start)	Poor, no history to leverage	Possible if user provides preferences
New item (cold start)	Cannot recommend (no interactions yet)	Works as soon as features exist
Discovery quality	High (can find unexpected cross-genre items)	Limited to items similar to past consumption
Domain transfer	Works without domain-specific features	Requires per-domain feature engineering
Popularity bias	Strong	Weaker
Niche items	Underrepresented	Better represented
Filter bubble risk	Higher	Lower
Best for	Large catalogs with rich interaction data	Domains with rich item descriptions and limited interactions

In practice, modern production systems are almost always hybrid: they combine CF with content features and often with knowledge graphs, sequential signals, and contextual features such as time and device. Hybridization mitigates the cold start problem of pure CF, the over-specialization of pure content-based filtering, and the popularity bias of either method used alone.

Cold start problem

The cold start problem is the central operational headache of CF. It manifests in three forms.

User cold start. A new user has no interaction history, so neighborhood methods cannot find similar users and matrix factorization cannot learn the user's latent vector. Common fixes include onboarding flows that ask the user to rate or pick from a curated set of popular items, default models built from demographic signals, and content-based fallbacks until enough interactions accrue.

Item cold start. A new item has no interactions, so it cannot appear in any neighborhood and cannot be ranked by a CF model. Common fixes include hybrid models that incorporate item features (text, image embeddings, categorical metadata) into the item tower, an explicit "new items" injection bucket, and early exploration in candidate generation.

System cold start. A brand new platform has no interactions at all. It typically launches with non-personalized popularity baselines, editorial curation, or content-based recommendations until usage accrues enough data for CF to take over.

Research on cold start mitigation has produced a long line of techniques: dropout-based augmentation in autoencoders, meta-learning approaches that adapt quickly from a few interactions, contrastive pre-training on rich item features, and large-language-model-driven cold start that uses textual item descriptions to bootstrap embeddings.

Other limitations

Data sparsity. Real interaction matrices are typically 95 to 99.99 percent empty. Sparsity makes neighborhood selection unreliable and leaves matrix factorization underdetermined for users and items with few observations. Dimensionality reduction, regularization, and side information all help.

Scalability. With hundreds of millions of users and items, exhaustive pairwise computation is infeasible. Solutions include approximate nearest neighbor (ANN) search with locality-sensitive hashing (LSH) or Hierarchical Navigable Small World (HNSW) indexes, distributed training on Spark or parameter servers, sub-linear sampling tricks during training, and GPU acceleration.

Popularity bias. CF inherently amplifies popular items because they appear in more user histories and accumulate more rating signal. The result is a feedback loop that under-recommends niche items. Mitigations include inverse propensity scoring, item-frequency penalties (the sampling correction in YouTube's two-tower model is one example), re-ranking for diversity, and explicit exploration mechanisms.

Filter bubbles and over-specialization. When CF works well it tends to produce items very similar to what a user already consumes, narrowing the feed over time. Diversity-aware re-ranking and serendipity metrics try to compensate.

Gray sheep. Users with idiosyncratic preferences that do not align with any neighborhood receive poor recommendations. There is no general solution; the standard mitigation is to fall back to content-based recommendations and to surface popular items.

Privacy. A CF model trained on detailed interaction logs encodes information about individual users. Membership inference attacks can determine whether a particular user was in the training set. Federated CF (for example FedNCF) keeps interaction data on user devices and shares only model gradients, and differentially private CF adds calibrated noise to the gradients to bound information leakage.

Adversarial robustness and shilling attacks. Because CF treats every user equally, attackers can inject fake user accounts that rate target items in coordinated ways to push them up or down in recommendations. The behavior is called a shilling attack or profile injection attack (the term "shill" comes from a fake bidder placed at an auction). A 2009 IEEE survey by Mobasher and Burke covered the early literature; later work has explored detection (clustering anomalies in the rating distribution), robust similarity functions, and adversarial training. Modern defenses combine behavioral fraud detection with model-level robustness.

Evaluation metrics

Evaluation in CF splits along the same line as the algorithms themselves. Rating prediction is judged by accuracy of the predicted score, while top-N recommendation is judged by quality of the ranked list.

Metric	Type	Description
RMSE (root mean squared error)	Rating accuracy	Square root of the mean squared error between predicted and actual ratings; was the official Netflix Prize metric
MAE (mean absolute error)	Rating accuracy	Mean absolute error between predicted and actual ratings; less sensitive to outliers than RMSE
Precision at K	Ranking quality	Fraction of the top K recommendations that are relevant
Recall at K	Ranking quality	Fraction of all relevant items that appear in the top K recommendations
Hit Rate at K	Ranking quality	Fraction of users for whom at least one held-out relevant item appears in the top K
NDCG at K	Ranking quality	Normalized discounted cumulative gain; rewards placing highly relevant items near the top, with a logarithmic position discount
MAP (mean average precision)	Ranking quality	Mean over users of average precision at every relevant rank
MRR (mean reciprocal rank)	Ranking quality	Mean of 1 over the rank of the first relevant item in each user's list
AUC (area under ROC)	Ranking quality	Probability that a random positive item is ranked above a random negative item
Coverage	Catalog quality	Fraction of the catalog that ever appears in any user's recommendations
Diversity	List quality	Average pairwise dissimilarity of items in a recommendation list
Serendipity	Beyond-accuracy	Frequency of unexpected but relevant recommendations
CTR (click-through rate)	Online	Fraction of impressions clicked
Watch time, conversion	Online	Aggregate engagement signals measured in live A/B tests

RMSE was the headline metric of the Netflix Prize era and remains the default for rating-prediction CF, but the field has largely moved on to ranking metrics (NDCG at 10 and Recall at 20 are the most common in academic CF papers since 2017). Industrial systems treat offline metrics as a screening filter and use online A/B tests on CTR, watch time, conversion, and longer-term retention as the ground truth.

Datasets

Collaborative filtering research depends heavily on a small set of public benchmark datasets.

Dataset	Domain	Size	Notes
MovieLens 100K	Movies	100,000 ratings, 943 users, 1,682 movies	Long-standing teaching dataset from GroupLens
MovieLens 1M	Movies	1 million ratings, 6,000 users, 4,000 movies	Released February 2003; very common benchmark
MovieLens 10M	Movies	10 million ratings, 72,000 users, 10,000 movies	Released January 2009
MovieLens 25M	Movies	25 million ratings, 162,000 users, 62,000 movies	Includes a tag genome with 1,129 tags
Netflix Prize	Movies	100 million ratings, 480,000 users, 17,770 movies	Released October 2006; withdrawn after a privacy lawsuit
Amazon Reviews	Products	230 million reviews across 24 product categories	Maintained by Julian McAuley's lab at UCSD
Yelp Open Dataset	Local businesses	About 7 million reviews of 150,000 businesses	Refreshed periodically by Yelp
Last.fm	Music	Millions of play counts; the LFM-1B and LFM-2B variants are widely used	Implicit feedback in play counts
Goodreads	Books	About 230 million interactions and 2 million books	Released by McAuley et al.
Steam	Video games	Play time and reviews for millions of users	Implicit feedback friendly
H&M Personalized Fashion	Apparel	Two years of customer transactions	Released for a 2022 Kaggle competition
Twitch	Live-streaming	Watch sessions with timestamps	Sequential CF benchmark

MovieLens remains the default sanity-check benchmark for new CF methods because the data is small enough to iterate on quickly, well-formed (one rating per user-movie pair), and decades old, which gives a deep comparison literature. Industrial systems are evaluated mostly on private logs, not these public sets.

Real-world systems

Collaborative filtering is deployed across nearly every consumer-facing platform with a large content catalog.

Platform	What CF powers	Notable details
Amazon	Product recommendations on the home page, product pages, and email campaigns	Pioneered item-to-item CF (Linden et al., 2003); recommendations reportedly drive a large share of revenue
Netflix	Title row personalization, similar-titles, autoplay	Funded the Netflix Prize; production stack moved from CineMatch through MF to deep learning
YouTube	Home feed, watch-next, search ranking	Two-tower neural retrieval (Yi et al., 2019) plus deep ranking (Covington et al., 2016)
Spotify	Discover Weekly, Daily Mix, Release Radar, similar artists	Combines CF on listening data with NLP on web text and audio embeddings
TikTok	The For You feed	Heavily sequential CF with transformer-style models on engagement signals
Pinterest	Related Pins, Home feed	PinSage GNN (Ying et al., 2018) at the candidate-generation stage
Instagram and Facebook	Reels, Explore, suggested posts and accounts	Two-tower retrieval and large neural rankers; Meta's open-source DLRM (Naumov et al., 2019) was an early industrial deep CF model
LinkedIn	People You May Know, job recommendations, feed ranking	CF blended with employment graph features
Twitter / X	Recommended Tweets and accounts	Two-tower retrieval with graph signals; partial open source via the heavy-ranker repo
Apple Music and Pandora	Personalized stations and recommendations	Pandora started from content-based (Music Genome Project) and added CF later
Goodreads	Book recommendations	CF on rating and shelf data
Steam	Game recommendations	CF on play time and purchase history
Google News	Article recommendations	CF on click and read time across the user base

In each case CF is one component of a larger system that includes business rules, freshness signals, ad targeting, and editorial curation. The actual algorithms behind these production systems evolve constantly and are usually only partially disclosed in research papers.

Implementations

A mature ecosystem of open-source libraries implements the algorithms above.

Library	Language	Focus
Surprise	Python	Classic CF algorithms (KNN, SVD, NMF, SlopeOne); scikit-learn-inspired API; great for teaching and small-scale experimentation
implicit	Python	Optimized C and Cython implementations of ALS, BPR, and LMF for implicit-feedback CF
LightFM	Python	Hybrid CF with WARP and BPR ranking losses; mixes item features with collaborative signal
Spark MLlib ALS	Scala / Java / Python	Distributed ALS for explicit and implicit feedback; production workhorse on big data clusters
TensorFlow Recommenders (TFRS)	Python	Modern two-tower retrieval, ranking, and full deep-CF stack on TensorFlow
TorchRec	Python	Meta's distributed embedding library for very large CF models on PyTorch
RecBole	Python	Comprehensive benchmark suite covering 80+ recent CF and sequential models
Microsoft Recommenders	Python	End-to-end examples and best-practice notebooks for CF, hybrid, and ranking tasks
Cornac	Python	Multimodal CF library that mixes text, images, and graphs with classic CF
RecPack	Python	Modular framework for evaluating top-N CF
Vowpal Wabbit	C++ / Python	Online learning of large-scale CF with quadratic features and contextual bandits
FAISS, ScaNN, HNSWlib	C++ / Python	Approximate nearest neighbor libraries used to serve two-tower CF embeddings

For research, RecBole and Microsoft Recommenders are the most common starting points. For production at small to mid scale, the implicit library and Spark MLlib ALS are widely used. For large-scale neural CF, TFRS and TorchRec are the dominant frameworks, paired with FAISS or ScaNN for serving.

Modern context (2024-2026)

Collaborative filtering is no longer a standalone subfield in the way it was in the Netflix Prize era. Three big trends define the current landscape.

LLM-based and generative recommendation. The P5 framework (Geng, Liu, Fu, Ge, and Zhang, 2022) reformulated recommendation tasks as natural-language sequence-to-sequence problems, fine-tuning a T5 model on prompts that encode user history, item descriptions, and task instructions. Subsequent work (RecLM, OpenP5, generative retrieval models that emit item IDs as tokens) has shown that large language models can perform reasonable zero-shot and few-shot recommendation. A practical open question is whether LLM-based recommenders can match the latency and engagement of well-tuned two-tower retrievers at the scale of YouTube or TikTok. Most production systems in 2025 use LLMs as auxiliary signal generators (for example, embedding item text and user profiles) rather than as the primary recommender.

Foundation models for CF. Several groups are training general-purpose recommendation models on multi-domain interaction data, in the hope that they will transfer to new domains as cheaply as a fine-tuned LLM does to a new task. Work along these lines includes UniSRec, RecFormer, and several closed industrial models at Meta and Google.

Privacy, federation, and on-device CF. Tighter regulation on user tracking has driven interest in CF that can be trained without centralizing interaction logs. Federated CF (FedCF, FedNCF) trains models across user devices and aggregates only gradients. Differentially private CF adds calibrated noise to the gradients to bound the information any single user contributes. Apple has shipped on-device CF for several recommendation surfaces using its CoreML and federated learning infrastructure.

Classical CF still does most of the work at companies that ship recommendations to billions of users. The headline 2024-2026 papers are often about safety, fairness, efficiency, and integration with LLMs rather than fundamental new CF algorithms.

References

Goldberg, D., Nichols, D., Oki, B. M., & Terry, D. (1992). "Using collaborative filtering to weave an information tapestry." *Communications of the ACM*, 35(12), 61-70. doi:10.1145/138859.138867.
Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P., & Riedl, J. (1994). "GroupLens: An open architecture for collaborative filtering of Netnews." *Proceedings of the 1994 ACM Conference on Computer Supported Cooperative Work*, 175-186.
Shardanand, U., & Maes, P. (1995). "Social information filtering: Algorithms for automating 'word of mouth'." *Proceedings of the SIGCHI Conference on Human Factors in Computing Systems*, 210-217.
Sarwar, B., Karypis, G., Konstan, J., & Riedl, J. (2001). "Item-based collaborative filtering recommendation algorithms." *Proceedings of the 10th International Conference on World Wide Web*, 285-295.
Linden, G., Smith, B., & York, J. (2003). "Amazon.com recommendations: Item-to-item collaborative filtering." *IEEE Internet Computing*, 7(1), 76-80. doi:10.1109/MIC.2003.1167344.
Funk, S. (2006). "Netflix update: Try this at home." Personal blog post, December 2006. (Popularly known as Funk SVD.)
Salakhutdinov, R., & Mnih, A. (2007). "Probabilistic matrix factorization." *Advances in Neural Information Processing Systems 20*, 1257-1264.
Bennett, J., & Lanning, S. (2007). "The Netflix Prize." *Proceedings of KDD Cup and Workshop*.
Hu, Y., Koren, Y., & Volinsky, C. (2008). "Collaborative filtering for implicit feedback datasets." *Proceedings of the 8th IEEE International Conference on Data Mining*, 263-272. doi:10.1109/ICDM.2008.22.
Koren, Y. (2008). "Factorization meets the neighborhood: A multifaceted collaborative filtering model." *Proceedings of KDD 2008*, 426-434.
Koren, Y., Bell, R., & Volinsky, C. (2009). "Matrix factorization techniques for recommender systems." *Computer*, 42(8), 30-37.
Mobasher, B., Burke, R., Bhaumik, R., & Williams, C. (2007). "Toward trustworthy recommender systems: An analysis of attack models and algorithm robustness." *ACM Transactions on Internet Technology*, 7(4), 23.
Su, X., & Khoshgoftaar, T. M. (2009). "A survey of collaborative filtering techniques." *Advances in Artificial Intelligence*, 2009, Article 421425.
Toscher, A., Jahrer, M., & Bell, R. (2009). "The BigChaos solution to the Netflix Grand Prize." Netflix Prize documentation.
Koren, Y. (2009). "The BellKor solution to the Netflix Grand Prize." Netflix Prize documentation.
Sedhain, S., Menon, A. K., Sanner, S., & Xie, L. (2015). "AutoRec: Autoencoders meet collaborative filtering." *Proceedings of the 24th International Conference on World Wide Web (Companion Volume)*, 111-112.
Hidasi, B., Karatzoglou, A., Baltrunas, L., & Tikk, D. (2016). "Session-based recommendations with recurrent neural networks." *International Conference on Learning Representations*.
Covington, P., Adams, J., & Sargin, E. (2016). "Deep neural networks for YouTube recommendations." *Proceedings of the 10th ACM Conference on Recommender Systems*, 191-198.
He, X., Liao, L., Zhang, H., Nie, L., Hu, X., & Chua, T. S. (2017). "Neural collaborative filtering." *Proceedings of the 26th International Conference on World Wide Web*, 173-182.
Liang, D., Krishnan, R. G., Hoffman, M. D., & Jebara, T. (2018). "Variational autoencoders for collaborative filtering." *Proceedings of the 2018 World Wide Web Conference*, 689-698.
Kang, W. C., & McAuley, J. (2018). "Self-attentive sequential recommendation." *Proceedings of the 2018 IEEE International Conference on Data Mining*, 197-206.
Ying, R., He, R., Chen, K., Eksombatchai, P., Hamilton, W. L., Leskovec, J., & Rosenberg, C. (2018). "Graph convolutional neural networks for web-scale recommender systems." *Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining*, 974-983.
Sun, F., Liu, J., Wu, J., Pei, C., Lin, X., Ou, W., & Jiang, P. (2019). "BERT4Rec: Sequential recommendation with bidirectional encoder representations from transformer." *Proceedings of the 28th ACM International Conference on Information and Knowledge Management*, 1441-1450.
Wang, X., He, X., Wang, M., Feng, F., & Chua, T. S. (2019). "Neural graph collaborative filtering." *Proceedings of SIGIR 2019*, 165-174.
Yi, X., Yang, J., Hong, L., Cheng, D. Z., Heldt, L., Kumthekar, A., Zhao, Z., Wei, L., & Chi, E. (2019). "Sampling-bias-corrected neural modeling for large corpus item recommendations." *Proceedings of the 13th ACM Conference on Recommender Systems*, 269-277.
Naumov, M., Mudigere, D., Shi, H. M., et al. (2019). "Deep learning recommendation model for personalization and recommendation systems." arXiv:1906.00091.
He, X., Deng, K., Wang, X., Li, Y., Zhang, Y., & Wang, M. (2020). "LightGCN: Simplifying and powering graph convolution network for recommendation." *Proceedings of the 43rd International ACM SIGIR Conference*, 639-648.
Rendle, S., Krichene, W., Zhang, L., & Anderson, J. (2020). "Neural collaborative filtering vs. matrix factorization revisited." *Proceedings of the 14th ACM Conference on Recommender Systems*, 240-248.
Geng, S., Liu, S., Fu, Z., Ge, Y., & Zhang, Y. (2022). "Recommendation as language processing (RLP): A unified pretrain, personalized prompt and predict paradigm (P5)." *Proceedings of the 16th ACM Conference on Recommender Systems*, 299-315.
Smith, B., & Linden, G. (2017). "Two decades of recommender systems at Amazon.com." *IEEE Internet Computing*, 21(3), 12-18.

Explain like I'm 5 (ELI5)

Definition and core intuition

Historical background

Tapestry (1992)

GroupLens (1994)

Ringo, Firefly, and the late 1990s

Amazon item-to-item (2003)

The Netflix Prize (2006-2009)

Two main approaches

Memory-based (neighborhood) methods

User-based collaborative filtering

Item-based collaborative filtering

Model-based methods

Similarity metrics

Matrix factorization in collaborative filtering

Funk SVD (2006)

SVD++ and timeSVD++ (2008)

Non-negative matrix factorization (NMF)

Probabilistic matrix factorization (PMF)

Alternating least squares (ALS)

Implicit feedback ALS (Hu, Koren, Volinsky, 2008)

Implicit versus explicit feedback

Deep learning for collaborative filtering

Neural collaborative filtering (NCF, 2017)

Autoencoders for CF

Two-tower neural retrieval

Graph neural networks for CF

Sequential and transformer-based CF

Comparison with content-based filtering

Cold start problem

Other limitations

Evaluation metrics

Datasets

Real-world systems

Implementations

Modern context (2024-2026)

See also

References

Improve this article

Related Articles

Machine learning terms/Recommendation Systems

Candidate Generation

Item matrix

Matrix factorization

Ranking

Recommendation System

Explain like I'm 5 (ELI5)

Definition and core intuition

Historical background

Tapestry (1992)

GroupLens (1994)

Ringo, Firefly, and the late 1990s

Amazon item-to-item (2003)

The Netflix Prize (2006-2009)

Two main approaches

Memory-based (neighborhood) methods

User-based collaborative filtering

Item-based collaborative filtering

Model-based methods

Similarity metrics

Matrix factorization in collaborative filtering

Funk SVD (2006)

SVD++ and timeSVD++ (2008)

Non-negative matrix factorization (NMF)

Probabilistic matrix factorization (PMF)

Alternating least squares (ALS)

Implicit feedback ALS (Hu, Koren, Volinsky, 2008)

Implicit versus explicit feedback

Deep learning for collaborative filtering

Neural collaborative filtering (NCF, 2017)

Autoencoders for CF

Two-tower neural retrieval

Graph neural networks for CF

Sequential and transformer-based CF

Comparison with content-based filtering

Cold start problem

Other limitations

Evaluation metrics

Datasets

Real-world systems