Weighted Alternating Least Squares (WALS), also called implicit Alternating Least Squares (iALS) or weighted regularized matrix factorization (WRMF), is a matrix factorization algorithm for collaborative filtering on implicit feedback data such as clicks, views, purchases, and play counts. It was introduced by Yifan Hu, Yehuda Koren, and Chris Volinsky at AT&T Labs in their 2008 paper "Collaborative Filtering for Implicit Feedback Datasets," presented at the IEEE International Conference on Data Mining (ICDM 2008). The same paper later won the IEEE ICDM Ten Year Highest Impact Paper Award in 2017, recognizing its central role in industrial recommendation systems.
WALS extends standard Alternating Least Squares by attaching a per-entry confidence weight c_ui to every user-item pair, then minimizing a weighted regularized squared error over the entire user-item matrix rather than only over observed ratings. Because the loss is a sum of weighted squared errors with a quadratic regularization term, the optimal user vector for any given item factor matrix (and vice versa) has a closed-form least squares solution. The algorithm alternates between solving for user vectors with item vectors fixed and solving for item vectors with user vectors fixed until convergence. A key algebraic trick in the original paper, sometimes called the "iALS efficiency reformulation," reduces the per-iteration cost from quadratic in the number of users and items to linear in the number of non-zero observations plus a single Gram-matrix update per iteration. This makes WALS practical on industrial data with hundreds of millions of users and tens of millions of items.
explain like I'm 5 (ELI5)
Imagine an ice cream shop that wants to know which flavors each customer would love. Some customers fill out star ratings ("I give chocolate 5 stars"). That is easy. But most customers do not bother with stars. They just walk in, point at a flavor, and walk out. The shop sees what they ordered, not what they thought of it. That is implicit feedback.
WALS is a way for the shop to guess flavors for everyone, even when nobody filled out star ratings. It treats every flavor and every customer as a small list of secret tastes, like "likes nutty," "likes fruity," "likes creamy." The algorithm makes those lists up at random, then keeps adjusting them so that ordered flavors get high scores and never-ordered flavors get low scores. To make sure the never-ordered ones do not drown out the few flavors a customer actually picked, the algorithm gives ordered flavors a much louder voice (a higher confidence weight). It flips back and forth: first it adjusts customer lists with flavor lists held still, then flavor lists with customer lists held still, and so on, until the guesses stop changing.
The finished lists let the shop predict a score for every customer-flavor pair, even ones nobody has ever tried. The highest scoring flavors for each customer become recommendations.
history
Before 2008, most public collaborative filtering research focused on explicit ratings, especially the five-star scale used by Netflix. The Netflix Prize (2006-2009) trained an entire generation of researchers on dense rating matrices with about 100 million observed ratings on a scale of 1 to 5. Algorithms such as biased SVD, SVD++, and Funk SVD optimized squared error over only the observed entries.
Industrial recommender systems faced a different problem. Television set-top boxes, music streaming services, web stores, and video sites collected huge volumes of behavioral data (channel tunes, song plays, purchases, page views) but very few star ratings. The available signal was implicit, noisy, and asymmetric: an interaction probably indicated some interest, but the absence of an interaction could mean dislike, ignorance, or simply that the user had never been exposed to the item. Treating non-interactions as zero ratings biased the model toward the most popular items, while ignoring them entirely (as in Funk SVD) discarded the bulk of the available signal.
Yifan Hu, Yehuda Koren, and Chris Volinsky proposed a clean solution in 2008. Working with set-top box television viewing data at AT&T Labs Research, they reformulated matrix factorization for implicit feedback in three steps:
- Replace the raw observation
r_ui with a binary preference variable p_ui (1 if any interaction occurred, 0 otherwise).
- Attach a confidence weight
c_ui that grows with the strength of the observed signal (typical form c_ui = 1 + alpha * r_ui).
- Sum the weighted squared error over every user-item pair, then optimize with alternating least squares.
Because every pair contributes to the loss, the resulting factor vectors balance positive and negative signals naturally. The algebraic reformulation in the paper (often abbreviated as the "add-then-subtract" trick) keeps the per-iteration cost manageable: instead of iterating over m * n user-item pairs, the algorithm iterates over the nnz(R) observed pairs plus a small fixed number of operations involving an f x f Gram matrix, where f is the latent dimension.
The paper appeared at IEEE ICDM 2008 in Pisa, Italy, won the conference's Best Paper award that year, and went on to receive the ICDM 10-Year Highest-Impact Paper Award in 2017. The technique became known by several names. Hu, Koren, and Volinsky called it simply "the implicit feedback model." The Mahout project labelled it "WRMF" (weighted regularized matrix factorization). The Apache Spark MLlib team and TensorFlow's contributors called it "implicit ALS" or "WALS" (weighted ALS). All four names refer to the same family of algorithms.
Let R be a sparse m x n interaction matrix where m is the number of users, n is the number of items, and the entry r_ui records the strength of user u's interaction with item i (a play count, a click count, a watch fraction, or simply 1 for any interaction). WALS introduces two derived quantities:
- A binary preference
p_ui = 1 if r_ui > 0, else 0. The preference encodes whether any interaction occurred.
- A confidence weight
c_ui that grows with r_ui. The original paper uses two functional forms:
- Linear:
c_ui = 1 + alpha * r_ui
- Logarithmic:
c_ui = 1 + alpha * log(1 + r_ui / epsilon)
A value alpha = 40 is the well-known starting point from the paper, and epsilon is a small constant (1e-8 is common for the logarithmic form). Even a single observation gives a confidence above the baseline of 1 that all pairs share, so observed entries always have a stronger pull than unobserved ones.
objective function
WALS finds an m x f user factor matrix X and an n x f item factor matrix Y that minimize:
L(X, Y) = sum_{u, i} c_ui (p_ui - x_u^T y_i)^2 + lambda (sum_u ||x_u||^2 + sum_i ||y_i||^2)
The outer sum runs over all m * n user-item pairs, not just observed ones. Each pair contributes a weighted squared error between the binary preference p_ui and the dot product x_u^T y_i, and a quadratic regularization term keeps the magnitudes of the factor vectors in check. The hyperparameter lambda controls regularization strength and is typically chosen by cross-validation.
alternating updates
With the item matrix Y held fixed, the loss is a quadratic in each user vector x_u independently. Setting the gradient to zero gives the closed-form solution:
x_u = (Y^T C^u Y + lambda I)^{-1} Y^T C^u p(u)
where C^u is the n x n diagonal matrix with entries c_ui for the row of R belonging to user u, and p(u) is the column vector of preferences for that user. Symmetrically, with the user matrix X held fixed, each item vector has the closed-form solution:
y_i = (X^T C^i X + lambda I)^{-1} X^T C^i p(i)
WALS iterates between these two updates until convergence. Each step is provably non-increasing in the loss, which is why the algorithm is so stable in practice and requires almost no learning-rate tuning.
the efficiency trick
A naive implementation of Y^T C^u Y requires O(n * f^2) operations per user, leading to O(m * n * f^2) per outer iteration. That is intractable at industrial scale. The Hu, Koren, and Volinsky paper observes that:
Y^T C^u Y = Y^T Y + Y^T (C^u - I) Y
The first term, Y^T Y, does not depend on u and can be computed once per outer iteration in O(n * f^2) time. The second term, Y^T (C^u - I) Y, only involves rows of Y for items where c_ui > 1, which corresponds exactly to the items user u interacted with. For typical sparse data this is a small subset. The same decomposition applies to Y^T C^u p(u). The net effect is that each user update costs O(n_u * f^2 + f^3) operations, where n_u is the number of items that user u interacted with. Summed over all users, the per-iteration cost becomes O(nnz(R) * f^2 + (m + n) * f^3), which scales linearly with the number of observations.
This decomposition is what makes WALS feasible on real corpora. Without it, the same model would require billions of times more compute on a service with hundreds of millions of users.
prediction
After training, the predicted preference of user u for item i is the dot product:
r_hat_ui = x_u^T y_i
For recommendation, items are ranked by their predicted score for the user, optionally restricted to items the user has not yet interacted with. The dot-product structure also enables fast retrieval: candidate items can be located using approximate nearest neighbour search on the item factor matrix, with libraries such as FAISS or ScaNN providing sub-millisecond top-K retrieval over millions of items.
implicit vs explicit feedback
WALS exists because implicit feedback differs structurally from explicit ratings. The contrast drives the algorithm's design choices.
| Aspect | Explicit feedback | Implicit feedback |
|---|
| Example signals | 1-5 star ratings, thumbs up/down | Clicks, page views, purchases, plays, watch time |
| User intent | Direct preference statement | Behavioural proxy for preference |
| Coverage | Sparse; few users rate explicitly | Plentiful; every interaction is logged |
| Noise level | Lower (users opted in to rate) | Higher (a click is not always interest) |
| Missing data interpretation | Unknown rating | Possibly negative, possibly unobserved |
| Negative examples | Low star ratings | None directly available |
| Range | Bounded (e.g., 1 to 5) | Unbounded counts |
| Loss target | Predict the rating | Predict the preference, weighted by confidence |
| Sum over | Observed entries only | All user-item pairs (with weights) |
| Standard algorithm | Funk SVD, biased SVD, SVD++ | WALS / iALS, BPR-MF |
| Canonical paper | Koren, Bell, Volinsky 2009 (IEEE Computer) | Hu, Koren, Volinsky 2008 (ICDM) |
The choice between explicit and implicit feedback is not binary in practice. Many production systems blend both. SVD++ and timeSVD++, for example, augment explicit rating models with implicit "the user looked at this" signals. WALS sits firmly on the implicit side and is the workhorse for systems where explicit ratings are rare or absent.
the WALS algorithm step by step
The pseudocode below summarises the canonical WALS algorithm.
- Inputs. A sparse
m x n interaction matrix R with values r_ui >= 0, latent dimension f, regularization lambda, confidence weight scale alpha, number of iterations T.
- Build derived quantities. For each non-zero
r_ui, compute the preference p_ui = 1 and the confidence c_ui = 1 + alpha * r_ui. All other pairs have p_ui = 0 and c_ui = 1.
- Initialise
X and Y with small random values (commonly Gaussian with standard deviation 1/sqrt(f)).
- For
t = 1, 2, ..., T do:
a. Compute the global Gram matrix G_y = Y^T Y once.
b. For each user u, compute A_u = G_y + Y_u^T diag(c_ui - 1) Y_u + lambda I and b_u = Y_u^T diag(c_ui) p(u), where Y_u contains only the rows of Y for items user u interacted with. Solve A_u x_u = b_u to update x_u.
c. Compute the global Gram matrix G_x = X^T X once.
d. For each item i, perform the symmetric update for y_i.
- Stop when the loss stops improving by more than a tolerance, or after
T iterations.
- Predict
r_hat_ui = x_u^T y_i. Rank items per user for recommendations.
A few practical refinements appear in most production-grade implementations:
- Item biases are sometimes added by extending each user vector with a constant
1 and each item vector with a free bias coordinate.
- Frequency-aware regularization scales
lambda by the number of interactions the user or item has, as in Rendle, Krichene, Zhang, and Anderson's 2022 "Revisiting iALS" paper, which showed that careful regularization tuning lets iALS match deep collaborative filtering baselines.
- Conjugate gradient can replace the direct linear solve when
f is large, trading a slightly higher per-iteration cost for fewer outer iterations and lower memory.
- Negative sampling is sometimes added on top of the dense unobserved loss to focus the model on "hard negatives" rather than treating all unobserved items equally.
confidence weighting strategies
The confidence function c_ui is the lever that distinguishes WALS from naive matrix factorization on a 0/1 matrix. Different application domains favour different functional forms.
| Strategy | Formula | When to use |
|---|
| Constant | c_ui = 1 + alpha for all observed | Pure implicit binary signal (visited / did not visit) |
| Linear | c_ui = 1 + alpha * r_ui | Counts grow modestly; the original Hu/Koren/Volinsky default |
| Logarithmic | c_ui = 1 + alpha * log(1 + r_ui / epsilon) | Heavy-tailed counts; compresses the dynamic range |
| Square root | c_ui = 1 + alpha * sqrt(r_ui) | Balance between linear and log |
| Exposure-aware | c_ui depends on whether the item was actually shown | When impression logs are available |
| Time-decayed | c_ui = 1 + alpha * sum_t r_uti * decay(t) | Streaming services where recency matters |
A larger alpha puts more weight on observed pairs and effectively makes the algorithm fit positive signals harder. A smaller alpha makes the algorithm closer to a uniform-weight low-rank approximation of the binary preference matrix. In practice alpha is tuned by cross-validation on a held-out set.
implementations
WALS is widely implemented across the open source ecosystem. The table below summarises the most-used libraries and frameworks.
| Library / framework | Language | API / class | Notes |
|---|
| Apache Spark MLlib | Scala / Python / Java | pyspark.ml.recommendation.ALS with implicitPrefs=True | Distributed implementation; partitions users and items across cluster |
| implicit | Python (Cython + CUDA) | implicit.als.AlternatingLeastSquares | Single-machine; CPU and GPU; the de facto standard for medium-scale iALS |
| TensorFlow contrib (deprecated) | Python | tf.contrib.factorization.WALSMatrixFactorization | Native TensorFlow implementation; used at Google before Keras-era recommenders |
| TensorFlow Recommenders | Python | Two-tower retrieval | Modern Google recommendations stack; dot-product MF as a special case |
| Apache Mahout | Java | ParallelALSFactorizationJob ("WRMF" mode) | One of the earliest open implementations; Hadoop-based |
| Surprise | Python | (No native iALS; supports SVD, NMF, KNN-based) | Mostly explicit-feedback algorithms |
| LensKit | Python | lenskit.algorithms.als.ImplicitMF | Research-oriented; reproducible recommender benchmarks |
| LightFM | Python | (Hybrid; uses WARP / BPR loss rather than WALS) | Often compared against iALS as a baseline |
| Microsoft Recommenders | Python | Wraps Spark's ALS for implicit feedback | Includes notebooks reproducing benchmark results |
| QMF | C++ | Quora's matrix factorization library | Fast multithreaded WRMF |
Benfred Frederickson's implicit library deserves special mention. Its CUDA kernels can train a 100-factor iALS model on the Last.fm 1B dataset in minutes on a single GPU, an order of magnitude faster than the Spark implementation on a comparable cluster. The library also exposes BPR and Logistic Matrix Factorization, allowing direct comparison between WALS and ranking-based alternatives.
use cases
WALS has been applied across virtually every domain where implicit feedback dominates the available signal.
Music streaming services were early adopters. Spotify documented in a 2014 blog series by Erik Bernhardsson that its core collaborative filtering pipeline was built on iALS over user listening histories, with item embeddings then indexed by Annoy for approximate nearest neighbour search. Bernhardsson's open-source Annoy library was originally developed at Spotify partly to serve those WALS embeddings for retrieval. Even after Spotify moved much of its retrieval to two-tower neural models in the late 2010s, iALS remained a strong production baseline and an honest benchmark for new architectures.
video recommendation
The original Hu, Koren, and Volinsky paper used TV viewing data from set-top boxes: implicit signals from a US cable provider with millions of subscribers and tens of thousands of channels and shows. YouTube has published several recommender architectures over the years, including a 2016 paper ("Deep Neural Networks for YouTube Recommendations") that explicitly contrasts its deep retrieval model against matrix factorization baselines of the WALS family.
e-commerce
Large online stores use WALS over implicit purchase, view, and add-to-cart signals. Etsy described its use of an iALS-style model in a 2014 engineering blog post for personalised recommendations on its homepage, and Quora open-sourced its QMF library specifically to scale WRMF to its question-answering platform. The technique is also a standard in the AWS Personalize service's repertoire of off-the-shelf algorithms for implicit feedback.
advertising and content ranking
Click-through rate prediction systems often start with a WALS-style embedding learned over historical clicks, then enrich it with side features through models such as factorization machines, DLRM, or deep & cross networks. Even when downstream rankers are deep, the candidate generation step is frequently a dot-product retrieval over WALS-trained embeddings.
personalised search
Query-document interaction matrices can be factorized with WALS to produce query and document embeddings used as additional signals in learning-to-rank pipelines. Microsoft, Bing, and Yahoo have all published papers using iALS-style techniques as feature generators or candidate retrievers in personalised search.
scientific applications
Beyond classic recommendations, WALS appears in several scientific domains. Genomics studies use it to recover patient-mutation associations from sparse count matrices. Network analysis applies WALS to citation and collaboration graphs. Educational technology platforms use it to recommend exercises by treating student-question interactions as implicit feedback.
comparison with other matrix factorization methods
WALS is one of several matrix factorization algorithms. The choice depends on the data type, the loss function, and the engineering constraints.
| Method | Loss | Feedback | Optimization | Strengths | Weaknesses |
|---|
| Truncated SVD | Frobenius reconstruction | Fully observed dense matrix | Lanczos / randomized SVD | Optimal under squared error; closed form | Cannot handle missing data natively; treats zeros as observations |
| Funk SVD | Squared error on observed | Explicit | SGD | Simple; fast on sparse explicit data | Ignores unobserved pairs; weak on implicit feedback |
| Biased SVD | Squared error with biases | Explicit | SGD or ALS | Captures user/item bias; very strong on Netflix-style data | Same weakness on implicit feedback |
| SVD++ | Squared error with implicit term | Mixed | SGD | Combines explicit ratings with implicit signals | More expensive than basic SVD |
| NMF | KL or Frobenius | Counts (non-negative) | Multiplicative updates / ANLS | Interpretable parts; non-negative factors | No native confidence weighting for implicit |
| WALS / iALS | Weighted squared error over all pairs | Implicit | Alternating closed-form least squares | Industrial-scale implicit feedback; embeds well for ANN | Trains a global model; not personalised online |
| BPR-MF | Pairwise ranking (logistic) | Implicit | SGD over (user, pos, neg) triples | Optimises ranking directly; strong top-K performance | Triplet sampling can be slow; less stable than WALS |
| Logistic Matrix Factorization | Binary cross-entropy weighted | Implicit | SGD or ALS variants | Probabilistic interpretation; calibrated outputs | Often comparable to iALS but more hyperparameters |
| Probabilistic Matrix Factorization (PMF) | Gaussian likelihood + Gaussian prior | Explicit | MAP via SGD | Bayesian interpretation of regularization | Not designed for implicit feedback |
| Bayesian PMF | Same as PMF with hyperpriors | Explicit | MCMC | Automatic hyperparameter selection | Higher computational cost |
| Factorization Machines | Squared / logistic with side features | Either | SGD / ALS / MCMC | Handles arbitrary features | Loses some scalability benefits of pure MF |
| CCD++ | Squared error rank-one updates | Either | Cyclic coordinate descent | Very fast on multi-core CPUs | Less common in modern toolchains |
| Deep Matrix Factorization | Reconstruction with neural transforms | Either | SGD | Captures non-linearities | Often beaten by tuned iALS in benchmarks |
WALS vs SVD
Classical truncated SVD operates on a fully observed dense matrix and assumes uniform error variance. Applying it to an implicit feedback matrix means treating unobserved pairs as hard zeros, which biases the decomposition toward popular items. WALS keeps the dense formulation but downweights unobserved pairs by setting their confidence to 1 while observed pairs get confidence 1 + alpha * r_ui. This single change converts SVD-style decomposition into a method that respects the asymmetry of implicit data.
WALS vs Funk SVD
Funk SVD only computes loss over observed entries, which works well for explicit ratings but discards a key signal in implicit data: the absence of an interaction. WALS includes every pair in the loss with a small but non-zero weight, allowing the model to push apart observed and unobserved items in latent space.
WALS vs BPR
Both WALS and Bayesian Personalized Ranking (BPR) target implicit feedback, but they optimize different objectives. WALS minimizes weighted squared error over all pairs; BPR minimizes a pairwise ranking loss between observed and unobserved items. Empirically, BPR can produce sharper top-K rankings on dense datasets, while WALS is more stable, easier to parallelise, and converges in fewer iterations. The 2022 "Revisiting the Performance of iALS" paper by Rendle, Krichene, Zhang, and Anderson showed that with careful tuning of regularization, sampling, and frequency weighting, iALS matches or exceeds BPR on standard benchmarks (MovieLens, Pinterest, Million Songs).
WALS vs NMF
Non-negative matrix factorization constrains all factor entries to be non-negative, producing parts-based representations that are often interpretable. WALS has no sign constraint and is optimised for prediction quality on implicit data rather than interpretability. NMF can be adapted to implicit feedback (see Hu, Koren, and Volinsky's discussion of weighted NMF), but the WALS formulation is usually faster and easier to implement.
modern alternatives
Deep learning and self-supervised methods have produced several alternatives to WALS for implicit feedback recommendation.
| Method | Year | Architecture | Pros vs WALS | Cons vs WALS |
|---|
| Neural Collaborative Filtering (NCF) | 2017 | Two-branch MLP combining GMF and MLP | Captures non-linearities; flexible inputs | Slower; Rendle 2020 showed properly tuned WALS often matches it |
| Variational Autoencoder for CF (Mult-VAE) | 2018 | Encoder-decoder with multinomial likelihood | Strong on dense datasets; principled probabilistic loss | Larger memory footprint; harder retrieval |
| Two-tower retrieval | ~2016+ | Separate user and item encoders | Native support for side features | Higher engineering complexity |
| Deep Learning Recommendation Model (DLRM) | 2019 (Meta) | Embedding tables + MLP + feature interactions | Native handling of categorical and numeric features | Designed for ranking, not retrieval |
| Wide & Deep | 2016 (Google) | Linear wide branch + deep MLP branch | Generalisation plus memorisation | Heavier than WALS; needs feature engineering |
| SASRec | 2018 | Self-attention over user history | Captures sequential preferences | Requires user histories; offline retraining |
| BERT4Rec | 2019 | Bidirectional transformer | Stronger sequence modelling | Even higher compute cost |
| GRU4Rec | 2016 | Gated recurrent network over sessions | Good for session-based recommendation | Less of a general retrieval model |
| EASE / EASE^R | 2019 | Item-item linear autoencoder | Closed-form; sometimes beats neural baselines | O(n^2) memory in items |
| LightGCN | 2020 | Linearised graph convolution on user-item graph | Strong on benchmarks; conceptually simple | Graph propagation cost |
A recurring theme in this list is that carefully tuned WALS remains a competitive baseline. Steffen Rendle, Walid Krichene, Li Zhang, and John Anderson's 2020 RecSys paper "Neural Collaborative Filtering vs. Matrix Factorization Revisited" reproduced the original NCF experiments and showed that a properly hyperparameter-tuned dot-product matrix factorization outperforms the MLP-based learned similarity. Their 2022 follow-up, "Revisiting the Performance of iALS," showed the same for WALS specifically: with appropriate frequency-aware regularization, item bias, and the confidence functional form chosen by cross-validation, iALS matches or beats Mult-VAE, EASE, and LightGCN on the MovieLens-20M, MSD, and Yelp benchmarks.
In industry the picture is similar. Most large recommender pipelines combine WALS-style retrieval with a deep ranker. WALS-trained item embeddings feed an approximate nearest neighbour index (FAISS, ScaNN, Annoy) to generate a few hundred candidates per user in real time; a downstream neural model with side features then re-ranks those candidates. The role of WALS has narrowed from end-to-end recommender to fast candidate generator, but it remains an essential ingredient.
hyperparameters and tuning
The main hyperparameters of WALS are:
| Hyperparameter | Symbol | Typical values | Notes |
|---|
| Latent dimension | f | 32 to 512 | Higher gives more capacity but more memory and compute |
| Regularization | lambda | 1e-3 to 1e1 | Tune by cross-validation; frequency-aware variant scales by user/item interaction count |
| Confidence weight scale | alpha | 1 to 100 | Larger pulls observed pairs harder; the original paper used 40 |
| Confidence functional form | linear or log | Pick by experiment | Log compresses heavy-tailed counts |
| Number of iterations | T | 10 to 30 | Loss usually plateaus quickly |
| Bias term | optional | on / off | Frequently improves accuracy at trivial cost |
Unlike SGD, WALS has no learning rate to tune. The closed-form alternation removes a major source of fragility from the training process, which is one of the reasons it has remained popular over more than fifteen years.
limitations
WALS has well-known limitations that drove the development of richer methods.
- Linear interactions only. The dot product is symmetric and unable to capture complex non-linear interactions between users and items. Neural methods can in principle do better, although the Rendle 2020 results suggest the gap is smaller than it first appeared.
- Cold start. WALS produces a factor vector only for users and items present in the training matrix. New users and new items have no representation until they accumulate enough interactions. Side-information extensions (LightFM, factorization machines, two-tower models) address this.
- Static model. Standard WALS does not model temporal drift in user preferences. The iALS family can be retrained periodically (commonly daily or weekly) but does not naturally produce streaming updates between retrains.
- Popularity bias. Including all unobserved pairs with confidence 1 effectively encodes a uniform negative prior, which can over-recommend popular items if not corrected. Inverse propensity weighting and exposure-aware confidence functions help.
- Memory growth. The factor matrices
X and Y grow linearly with the number of users and items. For services with billions of users, hash embedding tricks, quantisation, or feature hashing become necessary.
- Identifiability. Like all matrix factorization, WALS solutions are unique only up to invertible transformations of the latent space, so latent dimensions are not directly comparable across runs.
- Single objective. WALS optimises one loss. Multi-task and multi-objective recommendation (relevance, diversity, fairness, exploration) require additional machinery on top.
- No native side features. Plain WALS uses only the user-item matrix. To incorporate item categories, user demographics, or contextual signals, the model needs to be combined with factorization machines or replaced by a two-tower / DLRM-style architecture.
Despite these limitations, WALS continues to be deployed in production at companies such as Spotify, Netflix, YouTube, Etsy, and Quora, often as the candidate generator inside a larger neural ranking system. It is fast to train, easy to tune, robust to noise, and produces embeddings that integrate cleanly with approximate nearest neighbour search.
relationship to other concepts
WALS sits at the intersection of several key ideas in recommendation and representation learning.
- It is a special case of weighted matrix factorization, where the weights happen to be derived from a confidence function.
- It generalises truncated SVD to handle non-uniform observation noise.
- It is closely related to Logistic Matrix Factorization, which uses a binary cross-entropy loss instead of squared error but otherwise has a very similar structure.
- The user and item factor matrices can be viewed as embeddings, the same construct used in deep learning models. Two-tower retrieval models, in their simplest dot-product form, reduce to WALS-style matrix factorization with neural side feature processing on each tower.
- The decomposition
Y^T C^u Y = Y^T Y + Y^T (C^u - I) Y is a special case of the Sherman-Morrison-Woodbury identity in numerical linear algebra.
- The 2014 Levy and Goldberg result that Word2Vec is implicit matrix factorization parallels the WALS formulation: both methods learn embeddings by approximating a co-occurrence (or interaction) matrix with a low-rank structure under a non-uniform weighting scheme.
see also
references
- Hu, Y., Koren, Y., & Volinsky, C. (2008). "Collaborative Filtering for Implicit Feedback Datasets." Proceedings of the 8th IEEE International Conference on Data Mining (ICDM), 263-272. http://yifanhu.net/PUB/cf.pdf
- Koren, Y., Bell, R., & Volinsky, C. (2009). "Matrix Factorization Techniques for Recommender Systems." IEEE Computer, 42(8), 30-37. https://ieeexplore.ieee.org/document/5197422
- Koren, Y. (2008). "Factorization Meets the Neighborhood: A Multifaceted Collaborative Filtering Model." Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 426-434.
- Pan, R., Zhou, Y., Cao, B., Liu, N. N., Lukose, R., Scholz, M., & Yang, Q. (2008). "One-Class Collaborative Filtering." Proceedings of the 8th IEEE International Conference on Data Mining (ICDM), 502-511. https://ieeexplore.ieee.org/document/4781145
- Rendle, S., Freudenthaler, C., Gantner, Z., & Schmidt-Thieme, L. (2009). "BPR: Bayesian Personalized Ranking from Implicit Feedback." Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence (UAI), 452-461. https://arxiv.org/abs/1205.2618
- Rendle, S., Krichene, W., Zhang, L., & Anderson, J. (2020). "Neural Collaborative Filtering vs. Matrix Factorization Revisited." Proceedings of the 14th ACM Conference on Recommender Systems (RecSys). https://arxiv.org/abs/2005.09683
- Rendle, S., Krichene, W., Zhang, L., & Anderson, J. (2022). "Revisiting the Performance of iALS on Item Recommendation Benchmarks." Proceedings of the 16th ACM Conference on Recommender Systems (RecSys). https://arxiv.org/abs/2110.14037
- Johnson, C. C. (2014). "Logistic Matrix Factorization for Implicit Feedback Data." NeurIPS Workshop on Distributed Machine Learning and Matrix Computations. https://web.stanford.edu/~rezab/nips2014workshop/submits/logmat.pdf
- He, X., Liao, L., Zhang, H., Nie, L., Hu, X., & Chua, T.-S. (2017). "Neural Collaborative Filtering." Proceedings of the 26th International Conference on World Wide Web (WWW), 173-182. https://arxiv.org/abs/1708.05031
- Liang, D., Krishnan, R. G., Hoffman, M. D., & Jebara, T. (2018). "Variational Autoencoders for Collaborative Filtering." Proceedings of the World Wide Web Conference (WWW). https://arxiv.org/abs/1802.05814
- Steck, H. (2019). "Embarrassingly Shallow Autoencoders for Sparse Data." Proceedings of the World Wide Web Conference (WWW), 3251-3257. https://arxiv.org/abs/1905.03375
- He, X., Deng, K., Wang, X., Li, Y., Zhang, Y., & Wang, M. (2020). "LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation." Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. https://arxiv.org/abs/2002.02126
- Naumov, M., Mudigere, D., Shi, H. M., et al. (2019). "Deep Learning Recommendation Model for Personalization and Recommendation Systems." Meta AI technical report. https://arxiv.org/abs/1906.00091
- Cheng, H.-T., Koc, L., Harmsen, J., et al. (2016). "Wide & Deep Learning for Recommender Systems." Proceedings of the 1st Workshop on Deep Learning for Recommender Systems. https://arxiv.org/abs/1606.07792
- Covington, P., Adams, J., & Sargin, E. (2016). "Deep Neural Networks for YouTube Recommendations." Proceedings of the 10th ACM Conference on Recommender Systems (RecSys), 191-198. https://research.google/pubs/pub45530/
- Bernhardsson, E. (2014). "Music recommendations at Spotify." Engineering blog post and conference talks. https://erikbern.com/2014/02/01/how-music-recommendations-work-and-dont-work.html
- Spark MLlib documentation. "Collaborative Filtering with Alternating Least Squares." Apache Software Foundation. https://spark.apache.org/docs/latest/ml-collaborative-filtering.html
- Frederickson, B. (2017). "Implicit: Fast Python Collaborative Filtering for Implicit Datasets." GitHub repository. https://github.com/benfred/implicit
- Quora Engineering. (2016). "QMF: A Matrix Factorization Library." GitHub repository. https://github.com/quora/qmf
- TensorFlow contributors. "WALSMatrixFactorization." TensorFlow Estimator API documentation (deprecated in TF 2.x in favour of TensorFlow Recommenders). https://www.tensorflow.org/recommenders
- IEEE ICDM. (2017). "ICDM 2017 Awards: Ten Year Highest-Impact Paper Award presented to Yifan Hu, Yehuda Koren, and Chris Volinsky for Collaborative Filtering for Implicit Feedback Datasets." https://www.cs.uvm.edu/~icdm/awards.shtml
- Levy, O., & Goldberg, Y. (2014). "Neural Word Embedding as Implicit Matrix Factorization." Advances in Neural Information Processing Systems (NIPS), 27. https://papers.nips.cc/paper/5477-neural-word-embedding-as-implicit-matrix-factorization
- Salakhutdinov, R., & Mnih, A. (2008). "Bayesian Probabilistic Matrix Factorization using Markov Chain Monte Carlo." Proceedings of the 25th International Conference on Machine Learning (ICML). https://www.cs.toronto.edu/~amnih/papers/bpmf.pdf
- Mnih, A., & Salakhutdinov, R. (2007). "Probabilistic Matrix Factorization." Advances in Neural Information Processing Systems (NIPS), 20. https://papers.nips.cc/paper/2007/hash/d7322ed717dedf1eb4e6e52a37ea7bcd-Abstract.html
- Funk, S. (2006). "Netflix Update: Try This at Home." Blog post, December 11, 2006. https://sifter.org/simon/journal/20061211.html