Collaborative filtering (CF) is a family of techniques used in recommendation systems that predicts a user's preferences by collecting and analyzing preference signals from many other users. The core assumption is the wisdom of the crowd: if two users have agreed on the quality of certain items in the past, they are likely to agree on other items in the future. Collaborative filtering ignores the actual content of items and works only from observed user-item interactions, which means a single CF algorithm can power recommendations for movies, songs, products, news articles, or research papers without any domain-specific feature engineering.
Collaborative filtering is one of the most widely deployed approaches in machine learning for personalization. It powers feeds and homepages at companies including Netflix, Amazon, Spotify, and YouTube, and it remains a foundational building block in modern industrial recommendation pipelines, even after the rise of large neural and graph-based models.
Imagine you and your friend both love the same cartoons. One day your friend watches a new cartoon and says it is great. Even though you have never seen it, you would probably like it too, because you and your friend have the same taste. Collaborative filtering works the same way. It looks at what lots of people watch or buy, finds the people who like the same things you like, and then suggests things those people enjoyed that you have not tried yet.
A collaborative filter takes as input a sparse user-item interaction matrix R of size m by n, where m is the number of users and n is the number of items. Each entry R[u, i] holds either an explicit rating (for example a 1 to 5 star score), an implicit signal (a click, watch, purchase, or play count), or is missing because user u has not interacted with item i. In a typical real-world catalog more than 99% of entries are missing.
The goal of CF is to fill in the missing entries, or equivalently to rank the unseen items for each user from most to least likely to be enjoyed. The algorithm does this by exploiting two regularities in R:
Neither regularity needs item descriptions or user profiles to work. CF is therefore considered a content-agnostic method, which is what distinguishes it from content-based filtering.
Collaborative filtering as a named idea is just over thirty years old, but its roots in social information filtering go back much further. The modern form began in the early 1990s and matured into a major academic and industrial subfield through three landmark events: the GroupLens project, Amazon's item-to-item algorithm, and the Netflix Prize.
The term collaborative filtering was coined by David Goldberg, David Nichols, Brian Oki, and Douglas Terry at Xerox PARC in their 1992 Communications of the ACM paper "Using collaborative filtering to weave an information tapestry." Tapestry was an experimental email and document-filtering system designed to handle the growing flood of internal Xerox mail and Usenet news. Users could annotate any document with reactions, for example marking a memo as "interesting" or replying to a thread. Other users wrote queries that referenced these annotations, such as "show me messages that Smith replied to." Tapestry required users to manually name the people whose opinions they trusted, which limited its scale, but it introduced the central idea that human reactions, not just document contents, can guide automated filtering.
In 1994, Paul Resnick, Neophytos Iacovou, Mitesh Suchak, Peter Bergstrom, and John Riedl, working at the University of Minnesota and MIT, introduced GroupLens at the ACM Conference on Computer Supported Cooperative Work. GroupLens applied automated CF to Usenet netnews, where the volume of articles already exceeded what any individual could read. News-reader clients displayed predicted scores for each article and let users rate articles after reading them. Servers called "Better Bit Bureaus" gathered ratings, computed user similarities (initially with Pearson correlation), and pushed predictions back to clients. GroupLens was the first system to demonstrate that CF could scale to open communities of strangers and the first to formalize the user-based neighborhood algorithm that became a standard textbook reference.
Almost simultaneously, Upendra Shardanand and Pattie Maes at the MIT Media Lab introduced Ringo (1995), a CF system for music recommendations. Ringo evolved into the commercial venture Firefly Network, which Microsoft acquired in 1998. By the late 1990s, CF was being explored by online retailers, news aggregators, and early social networking sites. Research labs at GroupLens, MIT Media Lab, and Compaq's Systems Research Center compared similarity functions, neighborhood sizes, and aggregation rules, producing the toolkit of memory-based methods still in use today.
Amazon was an early commercial adopter of collaborative filtering and is responsible for one of the most influential algorithmic changes in the field. In a 2003 IEEE Internet Computing paper, Greg Linden, Brent Smith, and Jeremy York described Amazon's switch from user-based CF to item-to-item collaborative filtering. Instead of finding users who looked like the active customer and recommending what those users had bought, Amazon precomputed an item-to-item similarity table offline, then at request time looked up similar items for everything in the customer's history. The change made recommendations much faster, scaled to tens of millions of customers, and produced higher-quality lists. In 2017 the IEEE Internet Computing editorial board named the paper the single best "test of time" article from its first twenty years of publication.
In October 2006, Netflix released a 100 million rating dataset and offered a one million dollar prize to anyone who could improve the RMSE of its Cinematch system by 10 percent. Over 40,000 teams from 186 countries entered. The competition popularized matrix factorization as the dominant CF approach: Simon Funk's blog post in December 2006 describing a stochastic gradient descent variant of singular value decomposition (now known as Funk SVD) reset the leaderboard and inspired hundreds of derivative models. Yehuda Koren's SVD++ added implicit feedback signals, and timeSVD++ added temporal dynamics.
The prize was awarded on 21 September 2009 to BellKor's Pragmatic Chaos, a merged team consisting of Yehuda Koren and Robert Bell of AT&T Labs, Chris Volinsky of AT&T Labs, Andreas Toscher and Michael Jahrer of Commendo Research, and Martin Piotte and Martin Chabbert of Pragmatic Theory. Their winning solution, with a test RMSE of 0.8567, was a blend of more than 100 individual models combining matrix factorization, restricted Boltzmann machines, and neighborhood methods. Netflix later said it never put the full ensemble into production because of engineering complexity, but the techniques and the open dataset transformed academic recommender systems research for a decade.
Collaborative filtering algorithms divide cleanly into two families. Memory-based methods keep the entire interaction matrix in memory and search it at prediction time. Model-based methods learn a compact predictive model offline and use it at prediction time. The same training data can be used by either family, and modern systems frequently combine both.
| Aspect | Memory-based CF | Model-based CF |
|---|---|---|
| Prediction approach | Searches raw user-item data at prediction time | Uses a trained model to score user-item pairs |
| Training phase | None (lazy learning) | Required (offline training) |
| Prediction speed | Slower for large catalogs | Fast at prediction time |
| Handling sparsity | Struggles with very sparse data | More robust to sparsity |
| Interpretability | High (neighbors are explainable) | Lower (latent factors are opaque) |
| Scalability | Limited by pairwise similarity computation | Scales to billions of interactions |
| Update cost | Trivial (just append data) | Requires retraining or incremental updates |
| Cold start handling | Poor without auxiliary signals | Poor without auxiliary signals |
| Typical algorithms | User-KNN, item-KNN | Matrix factorization, neural CF, autoencoders, GNNs |
Memory-based CF, also called neighborhood-based CF, predicts a user's preference for an item by aggregating preferences from a chosen neighborhood. The neighborhood is selected using a similarity measure over the rows or columns of the interaction matrix. There are two flavors, distinguished by whether the neighborhood is over users or over items.
User-based CF finds the set of users most similar to the active user, then predicts a rating for an unseen item as the similarity-weighted average of those neighbors' ratings. Formally, the predicted rating r-hat for user u and item i is:
r-hat(u, i) = mean(u) + sum over v in N(u, i) of sim(u, v) * (r(v, i) - mean(v)) / sum over v in N(u, i) of |sim(u, v)|
where N(u, i) is the neighborhood of users similar to u who have rated i, and mean(u) is u's average rating. Subtracting the mean corrects for the fact that some users habitually give higher or lower ratings than others.
User-based CF is intuitive and gives explanations like "users who rated these films the way you did also liked Item X," which is easy to display in a UI. Its weakness is computational: pairwise user similarity scales as O(m^2), and the user neighborhood drifts every time anyone rates a new item, so similarity needs frequent recomputation in production.
Item-based CF reverses the geometry. Instead of looking for similar users, the system precomputes an item-item similarity table and predicts ratings using items already seen by the user as the neighborhood. The predicted rating becomes:
r-hat(u, i) = sum over j in N(i, u) of sim(i, j) * r(u, j) / sum over j in N(i, u) of |sim(i, j)|
where N(i, u) is the set of items rated by u that are most similar to target item i. The advantages over user-based CF were quantified by Sarwar, Karypis, Konstan, and Riedl in their 2001 WWW paper. Item-item similarities change much less often than user-user similarities, since rating one new movie barely affects the similarity profile of a movie with millions of ratings. The similarity table can therefore be computed offline and reused, making the online prediction step extremely fast. Linden, Smith, and York's 2003 paper showed that item-based CF was the technology behind Amazon's iconic "Customers who bought this also bought" widget and could serve recommendations for tens of millions of users at sub-second latency.
Model-based CF learns a parametric predictive model that captures the regularities in the interaction matrix. At prediction time, the system applies the model rather than searching the raw data. Model-based methods are more robust to sparsity, generally faster at serving, and easier to combine with side information, but they require an offline training stage and the learned representations are typically less interpretable than nearest-neighbor lists.
The dominant family of model-based methods is matrix factorization, which factors the interaction matrix into a low-rank product of a user matrix and an item matrix. Other classical model-based approaches include clustering-based CF, Bayesian networks, and restricted Boltzmann machines (which were a key ingredient of the Netflix Prize winning blend). Modern model-based CF is dominated by neural network architectures, including neural collaborative filtering, autoencoders, two-tower models, graph neural networks, and transformer-based sequential models.
Both user-based and item-based CF depend on a similarity measure to quantify how alike two rating vectors are. The choice of metric affects recommendation quality significantly, especially under sparsity.
| Metric | Definition | Range | Common use |
|---|---|---|---|
| Cosine similarity | Cosine of the angle between two rating vectors, treated as vectors in an item-space (or user-space) | -1 to 1 | Default for item-based CF; works well with implicit feedback |
| Pearson correlation | Linear correlation between two users' ratings, with each rating centered on the user's mean | -1 to 1 | Default for user-based CF; corrects for user-specific rating biases |
| Adjusted cosine similarity | Cosine similarity computed after subtracting each user's mean rating from each rating | -1 to 1 | Item-based CF when raters use the rating scale differently |
| Jaccard index | Size of intersection over size of union of two users' interacted-item sets | 0 to 1 | Implicit feedback or binary interactions |
| Spearman rank correlation | Pearson correlation applied to rating ranks rather than raw scores | -1 to 1 | When rating distributions are skewed or non-linear |
| Euclidean distance | Straight-line distance between two rating vectors | 0 to infinity | General-purpose; rarely the best CF choice |
In practice, cosine and adjusted cosine are the workhorse metrics for item-based CF, while Pearson dominates user-based CF on explicit ratings. With implicit feedback, the Jaccard index and conditional probability ratios (such as Amazon's "customers who bought X also bought Y" formula) are common.
Matrix factorization is the most influential model-based approach in CF. The central idea is to decompose the sparse user-item rating matrix R into the product of two lower-dimensional dense matrices: a user matrix U of size m by k and an item matrix V of size n by k, where k (the number of latent factors) is much smaller than m or n. The predicted rating for user u and item i is the dot product:
r-hat(u, i) = U[u] dot V[i]
Fitting U and V to the observed ratings via regularized squared error reduces the recommendation problem to a relatively standard convex-by-blocks optimization. Matrix factorization is treated in detail in the dedicated matrix factorization article. The summary below covers the variants most relevant to CF.
Simon Funk's blog post during the Netflix Prize popularized a stochastic gradient descent procedure that updates user and item factors one observed rating at a time, ignoring the millions of missing entries entirely. Despite its name, Funk SVD is not a true singular value decomposition (SVD requires a complete matrix) but a regularized matrix factorization optimized with SGD. The method was simple, fast, and dramatically more accurate than the neighborhood baselines. Funk finished third in the Netflix Prize and his approach became the basis for nearly every later matrix factorization variant.
Yehuda Koren's SVD++ extends Funk SVD by adding a second per-item factor that captures the influence of which items a user has rated, even if the rating value is unknown. This implicit signal often improves accuracy on top of the explicit signal alone. timeSVD++ goes a step further and lets user biases, item biases, and user factors drift over time, which captures effects such as users' tastes shifting and Netflix periodically re-calibrating its star scale. Both variants were core components of the BellKor and BellKor's Pragmatic Chaos solutions.
NMF imposes the constraint that all entries of U and V are non-negative. The constraint produces parts-based representations that are easier to interpret, since a learned factor cannot subtract from another. NMF is well suited to non-negative data such as count matrices and is used in topic modeling and image decomposition as well as CF.
Probabilistic matrix factorization (Salakhutdinov and Mnih, 2007) gives matrix factorization a Bayesian formulation in which user and item factors have Gaussian priors and ratings are conditional Gaussians given the dot product. PMF generalizes well on sparse data and provides a principled way to add user-item side information through the priors. Bayesian PMF (BPMF), introduced the following year, removes the need to set hyperparameters by hand by sampling them with MCMC.
ALS optimizes matrix factorization by alternately fixing one factor matrix and solving the other. With one matrix held constant, the loss is convex in the other and reduces to a regularized least-squares problem with a closed-form solution per row. The procedure is highly parallel: every user row update is independent, and every item row update is independent, so ALS scales naturally on distributed engines such as Apache Spark MLlib.
Hu, Koren, and Volinsky's 2008 ICDM paper extended ALS to implicit feedback. Their key insight is that for implicit data, every user-item pair is observed: an interaction means "some preference, some confidence" and a non-interaction means "some preference (probably negative), low confidence." The model treats the binary preference p and a confidence weight c separately, then minimizes the confidence-weighted squared error over the full m by n matrix using ALS with closed-form linear-algebra tricks that exploit the low rank structure. The algorithm became the default factorization for implicit data and is the basis for the open-source implicit Python library and the implicit-mode of Spark ALS. The paper won the IEEE ICDM 10-Year Highest-Impact Paper Award in 2017.
Collaborative filtering systems differ in the type of feedback they consume.
Explicit feedback is direct, intentional input from users: star ratings, thumbs-up or thumbs-down, written reviews. It expresses preference clearly, but it is sparse: most users rate only a small fraction of items they interact with, and the act of rating itself selects for unusual reactions (very good or very bad).
Implicit feedback is inferred from observed behavior: clicks, watches, purchases, dwell time, plays, scroll depth, repeat visits. It is far more abundant than explicit feedback. Its weakness is interpretation: viewing a product page does not necessarily indicate a positive preference, and the absence of an interaction does not reliably indicate disinterest (the user may not have seen the item).
| Feedback type | Examples | Advantages | Disadvantages |
|---|---|---|---|
| Explicit | Star ratings, likes, written reviews, thumbs up or down | Clear preference signal; easy to model | Very sparse; requires user effort; selection bias |
| Implicit | Clicks, purchases, watch time, listen counts, dwell time, scroll depth | Abundant; no extra user effort; richer behavioral coverage | Noisy; non-interaction is ambiguous; biased by exposure |
Most large production recommenders today rely primarily on implicit feedback because of its abundance. Hu, Koren, and Volinsky's 2008 weighted-confidence ALS framework gave the field its first principled implicit-feedback factorization. Many systems still mix in explicit signals where available to disambiguate cases such as a long click followed by a quick close (negative implicit signal but a positive click).
The rise of deep learning after 2014 brought several neural architectures into the CF toolkit. The motivation was that classic matrix factorization assumes a linear interaction (a dot product) between user and item factors, while real preferences may depend non-linearly on combinations of factors and on side information that is hard to slot into a linear model.
Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua's 2017 paper at WWW introduced neural collaborative filtering (NCF). NCF replaces the dot product of matrix factorization with a feedforward neural network that learns an arbitrary user-item interaction function from data. The paper proposed three model variants: Generalized Matrix Factorization (GMF), which generalizes the standard inner product; Multi-Layer Perceptron (MLP), which feeds concatenated user and item embeddings into a stack of fully connected layers; and NeuMF, which fuses GMF and MLP and trains them jointly. NCF was evaluated on the MovieLens 1M and Pinterest datasets and significantly outperformed traditional MF on Hit Rate at 10 and NDCG at 10. NCF also normalized the practice of training CF models on implicit feedback with binary cross-entropy loss and negative sampling, which became standard.
A 2020 SIGIR paper by Rendle, Krichene, Zhang, and Anderson re-examined NCF and showed that with careful tuning, a regularized matrix factorization baseline could match or beat NeuMF on the same benchmarks. The exchange highlighted how brittle published CF benchmarks can be and contributed to the community's increased focus on rigorous evaluation.
Autoencoder-based CF treats each user as a sparse vector in item-space and trains an autoencoder to reconstruct it. AutoRec (Sedhain, Menon, Sanner, and Xie, 2015) was the first such model. CDAE (Wu, DuBois, Zheng, and Ester, 2016) added denoising. Mult-VAE (Liang, Krishnan, Hoffman, and Jebara, 2018) used a variational autoencoder with a multinomial likelihood and showed strong performance on MovieLens-20M and Netflix data. Autoencoder CF tends to outperform traditional MF on top-N recommendation, particularly on implicit data, because the multinomial likelihood is a better match for click counts than squared error.
Large-scale industry recommenders often need to rank a few thousand candidates out of a corpus of hundreds of millions of items in milliseconds. The dominant architecture for this candidate-generation step is the two-tower model: a user tower encodes user features and recent history into a vector, an item tower encodes item features into a vector, and a dot product (or cosine similarity) between the two vectors scores each user-item pair. Item vectors are precomputed and indexed for approximate nearest neighbor retrieval. The user vector is computed at request time and the system retrieves the K nearest item vectors.
The seminal industrial paper is Covington, Adams, and Sargin's 2016 RecSys paper "Deep Neural Networks for YouTube Recommendations," which described a two-stage architecture (deep candidate generation followed by deep ranking) serving billions of users. Yi et al.'s 2019 RecSys paper "Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations" extended the two-tower retrieval approach to YouTube's tens of millions of videos and introduced an in-batch sampling correction that adjusts for the over-representation of popular items in random batches. The paper's streaming-data frequency estimation algorithm is now standard in YouTube's neural retrieval system.
Graph neural networks treat the user-item interaction matrix as a bipartite graph and learn representations by propagating information along edges. The first widely deployed industrial example was PinSage at Pinterest (Ying, He, Chen, Eksombatchai, Hamilton, Leskovec, and Rosenberg, 2018), trained on a graph of 3 billion pins and boards with 18 billion edges. PinSage uses random walks to sample importance-weighted neighborhoods and produces embeddings that combine graph structure with pin features (text, image, and category metadata).
Research GNN-CF models include NGCF (Wang, He, Wang, Feng, and Chua, 2019), which propagates user and item embeddings through several graph convolution layers, and LightGCN (He, Deng, Wang, Li, Zhang, and Wang, 2020), which simplifies NGCF by removing the feature transformation and non-linear activation and keeping only the neighborhood aggregation step. Despite its simplicity, LightGCN often matches or beats more complex GNN architectures on standard CF benchmarks and has become a strong baseline in academic papers.
Sequential recommenders model the time-ordered sequence of items each user has interacted with and predict the next item. They overlap heavily with CF: the training signal still comes from observed user-item interactions, but the model learns transitions and conditional patterns rather than static user representations.
GRU4Rec (Hidasi, Karatzoglou, Baltrunas, and Tikk, 2016) used a gated recurrent network and was the first widely cited deep sequential recommender. SASRec (Kang and McAuley, 2018) replaced the RNN with a unidirectional transformer (similar to a small GPT) and shifted the field toward self-attention. BERT4Rec (Sun, Liu, Wu, Pei, Lin, Ou, and Jiang, 2019) used bidirectional self-attention with a masked-item Cloze objective, modeled on BERT, and became a strong sequential CF baseline. Industrial recommenders at TikTok, Meta, Pinterest, and YouTube use heavily customized variants of these transformer architectures, often with hundreds of millions of parameters.
Collaborative filtering and content-based filtering are the two classical paradigms for recommendation, and they have complementary strengths.
| Aspect | Collaborative filtering | Content-based filtering |
|---|---|---|
| Required input | User-item interaction matrix | Item attribute features (text, tags, images, metadata) |
| Required user data | Interaction history with multiple items | History plus item attribute preferences |
| New user (cold start) | Poor, no history to leverage | Possible if user provides preferences |
| New item (cold start) | Cannot recommend (no interactions yet) | Works as soon as features exist |
| Discovery quality | High (can find unexpected cross-genre items) | Limited to items similar to past consumption |
| Domain transfer | Works without domain-specific features | Requires per-domain feature engineering |
| Popularity bias | Strong | Weaker |
| Niche items | Underrepresented | Better represented |
| Filter bubble risk | Higher | Lower |
| Best for | Large catalogs with rich interaction data | Domains with rich item descriptions and limited interactions |
In practice, modern production systems are almost always hybrid: they combine CF with content features and often with knowledge graphs, sequential signals, and contextual features such as time and device. Hybridization mitigates the cold start problem of pure CF, the over-specialization of pure content-based filtering, and the popularity bias of either method used alone.
The cold start problem is the central operational headache of CF. It manifests in three forms.
User cold start. A new user has no interaction history, so neighborhood methods cannot find similar users and matrix factorization cannot learn the user's latent vector. Common fixes include onboarding flows that ask the user to rate or pick from a curated set of popular items, default models built from demographic signals, and content-based fallbacks until enough interactions accrue.
Item cold start. A new item has no interactions, so it cannot appear in any neighborhood and cannot be ranked by a CF model. Common fixes include hybrid models that incorporate item features (text, image embeddings, categorical metadata) into the item tower, an explicit "new items" injection bucket, and early exploration in candidate generation.
System cold start. A brand new platform has no interactions at all. It typically launches with non-personalized popularity baselines, editorial curation, or content-based recommendations until usage accrues enough data for CF to take over.
Research on cold start mitigation has produced a long line of techniques: dropout-based augmentation in autoencoders, meta-learning approaches that adapt quickly from a few interactions, contrastive pre-training on rich item features, and large-language-model-driven cold start that uses textual item descriptions to bootstrap embeddings.
Data sparsity. Real interaction matrices are typically 95 to 99.99 percent empty. Sparsity makes neighborhood selection unreliable and leaves matrix factorization underdetermined for users and items with few observations. Dimensionality reduction, regularization, and side information all help.
Scalability. With hundreds of millions of users and items, exhaustive pairwise computation is infeasible. Solutions include approximate nearest neighbor (ANN) search with locality-sensitive hashing (LSH) or Hierarchical Navigable Small World (HNSW) indexes, distributed training on Spark or parameter servers, sub-linear sampling tricks during training, and GPU acceleration.
Popularity bias. CF inherently amplifies popular items because they appear in more user histories and accumulate more rating signal. The result is a feedback loop that under-recommends niche items. Mitigations include inverse propensity scoring, item-frequency penalties (the sampling correction in YouTube's two-tower model is one example), re-ranking for diversity, and explicit exploration mechanisms.
Filter bubbles and over-specialization. When CF works well it tends to produce items very similar to what a user already consumes, narrowing the feed over time. Diversity-aware re-ranking and serendipity metrics try to compensate.
Gray sheep. Users with idiosyncratic preferences that do not align with any neighborhood receive poor recommendations. There is no general solution; the standard mitigation is to fall back to content-based recommendations and to surface popular items.
Privacy. A CF model trained on detailed interaction logs encodes information about individual users. Membership inference attacks can determine whether a particular user was in the training set. Federated CF (for example FedNCF) keeps interaction data on user devices and shares only model gradients, and differentially private CF adds calibrated noise to the gradients to bound information leakage.
Adversarial robustness and shilling attacks. Because CF treats every user equally, attackers can inject fake user accounts that rate target items in coordinated ways to push them up or down in recommendations. The behavior is called a shilling attack or profile injection attack (the term "shill" comes from a fake bidder placed at an auction). A 2009 IEEE survey by Mobasher and Burke covered the early literature; later work has explored detection (clustering anomalies in the rating distribution), robust similarity functions, and adversarial training. Modern defenses combine behavioral fraud detection with model-level robustness.
Evaluation in CF splits along the same line as the algorithms themselves. Rating prediction is judged by accuracy of the predicted score, while top-N recommendation is judged by quality of the ranked list.
| Metric | Type | Description |
|---|---|---|
| RMSE (root mean squared error) | Rating accuracy | Square root of the mean squared error between predicted and actual ratings; was the official Netflix Prize metric |
| MAE (mean absolute error) | Rating accuracy | Mean absolute error between predicted and actual ratings; less sensitive to outliers than RMSE |
| Precision at K | Ranking quality | Fraction of the top K recommendations that are relevant |
| Recall at K | Ranking quality | Fraction of all relevant items that appear in the top K recommendations |
| Hit Rate at K | Ranking quality | Fraction of users for whom at least one held-out relevant item appears in the top K |
| NDCG at K | Ranking quality | Normalized discounted cumulative gain; rewards placing highly relevant items near the top, with a logarithmic position discount |
| MAP (mean average precision) | Ranking quality | Mean over users of average precision at every relevant rank |
| MRR (mean reciprocal rank) | Ranking quality | Mean of 1 over the rank of the first relevant item in each user's list |
| AUC (area under ROC) | Ranking quality | Probability that a random positive item is ranked above a random negative item |
| Coverage | Catalog quality | Fraction of the catalog that ever appears in any user's recommendations |
| Diversity | List quality | Average pairwise dissimilarity of items in a recommendation list |
| Serendipity | Beyond-accuracy | Frequency of unexpected but relevant recommendations |
| CTR (click-through rate) | Online | Fraction of impressions clicked |
| Watch time, conversion | Online | Aggregate engagement signals measured in live A/B tests |
RMSE was the headline metric of the Netflix Prize era and remains the default for rating-prediction CF, but the field has largely moved on to ranking metrics (NDCG at 10 and Recall at 20 are the most common in academic CF papers since 2017). Industrial systems treat offline metrics as a screening filter and use online A/B tests on CTR, watch time, conversion, and longer-term retention as the ground truth.
Collaborative filtering research depends heavily on a small set of public benchmark datasets.
| Dataset | Domain | Size | Notes |
|---|---|---|---|
| MovieLens 100K | Movies | 100,000 ratings, 943 users, 1,682 movies | Long-standing teaching dataset from GroupLens |
| MovieLens 1M | Movies | 1 million ratings, 6,000 users, 4,000 movies | Released February 2003; very common benchmark |
| MovieLens 10M | Movies | 10 million ratings, 72,000 users, 10,000 movies | Released January 2009 |
| MovieLens 25M | Movies | 25 million ratings, 162,000 users, 62,000 movies | Includes a tag genome with 1,129 tags |
| Netflix Prize | Movies | 100 million ratings, 480,000 users, 17,770 movies | Released October 2006; withdrawn after a privacy lawsuit |
| Amazon Reviews | Products | 230 million reviews across 24 product categories | Maintained by Julian McAuley's lab at UCSD |
| Yelp Open Dataset | Local businesses | About 7 million reviews of 150,000 businesses | Refreshed periodically by Yelp |
| Last.fm | Music | Millions of play counts; the LFM-1B and LFM-2B variants are widely used | Implicit feedback in play counts |
| Goodreads | Books | About 230 million interactions and 2 million books | Released by McAuley et al. |
| Steam | Video games | Play time and reviews for millions of users | Implicit feedback friendly |
| H&M Personalized Fashion | Apparel | Two years of customer transactions | Released for a 2022 Kaggle competition |
| Twitch | Live-streaming | Watch sessions with timestamps | Sequential CF benchmark |
MovieLens remains the default sanity-check benchmark for new CF methods because the data is small enough to iterate on quickly, well-formed (one rating per user-movie pair), and decades old, which gives a deep comparison literature. Industrial systems are evaluated mostly on private logs, not these public sets.
Collaborative filtering is deployed across nearly every consumer-facing platform with a large content catalog.
| Platform | What CF powers | Notable details |
|---|---|---|
| Amazon | Product recommendations on the home page, product pages, and email campaigns | Pioneered item-to-item CF (Linden et al., 2003); recommendations reportedly drive a large share of revenue |
| Netflix | Title row personalization, similar-titles, autoplay | Funded the Netflix Prize; production stack moved from CineMatch through MF to deep learning |
| YouTube | Home feed, watch-next, search ranking | Two-tower neural retrieval (Yi et al., 2019) plus deep ranking (Covington et al., 2016) |
| Spotify | Discover Weekly, Daily Mix, Release Radar, similar artists | Combines CF on listening data with NLP on web text and audio embeddings |
| TikTok | The For You feed | Heavily sequential CF with transformer-style models on engagement signals |
| Related Pins, Home feed | PinSage GNN (Ying et al., 2018) at the candidate-generation stage | |
| Instagram and Facebook | Reels, Explore, suggested posts and accounts | Two-tower retrieval and large neural rankers; Meta's open-source DLRM (Naumov et al., 2019) was an early industrial deep CF model |
| People You May Know, job recommendations, feed ranking | CF blended with employment graph features | |
| Twitter / X | Recommended Tweets and accounts | Two-tower retrieval with graph signals; partial open source via the heavy-ranker repo |
| Apple Music and Pandora | Personalized stations and recommendations | Pandora started from content-based (Music Genome Project) and added CF later |
| Goodreads | Book recommendations | CF on rating and shelf data |
| Steam | Game recommendations | CF on play time and purchase history |
| Google News | Article recommendations | CF on click and read time across the user base |
In each case CF is one component of a larger system that includes business rules, freshness signals, ad targeting, and editorial curation. The actual algorithms behind these production systems evolve constantly and are usually only partially disclosed in research papers.
A mature ecosystem of open-source libraries implements the algorithms above.
| Library | Language | Focus |
|---|---|---|
| Surprise | Python | Classic CF algorithms (KNN, SVD, NMF, SlopeOne); scikit-learn-inspired API; great for teaching and small-scale experimentation |
| implicit | Python | Optimized C and Cython implementations of ALS, BPR, and LMF for implicit-feedback CF |
| LightFM | Python | Hybrid CF with WARP and BPR ranking losses; mixes item features with collaborative signal |
| Spark MLlib ALS | Scala / Java / Python | Distributed ALS for explicit and implicit feedback; production workhorse on big data clusters |
| TensorFlow Recommenders (TFRS) | Python | Modern two-tower retrieval, ranking, and full deep-CF stack on TensorFlow |
| TorchRec | Python | Meta's distributed embedding library for very large CF models on PyTorch |
| RecBole | Python | Comprehensive benchmark suite covering 80+ recent CF and sequential models |
| Microsoft Recommenders | Python | End-to-end examples and best-practice notebooks for CF, hybrid, and ranking tasks |
| Cornac | Python | Multimodal CF library that mixes text, images, and graphs with classic CF |
| RecPack | Python | Modular framework for evaluating top-N CF |
| Vowpal Wabbit | C++ / Python | Online learning of large-scale CF with quadratic features and contextual bandits |
| FAISS, ScaNN, HNSWlib | C++ / Python | Approximate nearest neighbor libraries used to serve two-tower CF embeddings |
For research, RecBole and Microsoft Recommenders are the most common starting points. For production at small to mid scale, the implicit library and Spark MLlib ALS are widely used. For large-scale neural CF, TFRS and TorchRec are the dominant frameworks, paired with FAISS or ScaNN for serving.
Collaborative filtering is no longer a standalone subfield in the way it was in the Netflix Prize era. Three big trends define the current landscape.
LLM-based and generative recommendation. The P5 framework (Geng, Liu, Fu, Ge, and Zhang, 2022) reformulated recommendation tasks as natural-language sequence-to-sequence problems, fine-tuning a T5 model on prompts that encode user history, item descriptions, and task instructions. Subsequent work (RecLM, OpenP5, generative retrieval models that emit item IDs as tokens) has shown that large language models can perform reasonable zero-shot and few-shot recommendation. A practical open question is whether LLM-based recommenders can match the latency and engagement of well-tuned two-tower retrievers at the scale of YouTube or TikTok. Most production systems in 2025 use LLMs as auxiliary signal generators (for example, embedding item text and user profiles) rather than as the primary recommender.
Foundation models for CF. Several groups are training general-purpose recommendation models on multi-domain interaction data, in the hope that they will transfer to new domains as cheaply as a fine-tuned LLM does to a new task. Work along these lines includes UniSRec, RecFormer, and several closed industrial models at Meta and Google.
Privacy, federation, and on-device CF. Tighter regulation on user tracking has driven interest in CF that can be trained without centralizing interaction logs. Federated CF (FedCF, FedNCF) trains models across user devices and aggregates only gradients. Differentially private CF adds calibrated noise to the gradients to bound the information any single user contributes. Apple has shipped on-device CF for several recommendation surfaces using its CoreML and federated learning infrastructure.
Classical CF still does most of the work at companies that ship recommendations to billions of users. The headline 2024-2026 papers are often about safety, fairness, efficiency, and integration with LLMs rather than fundamental new CF algorithms.