Weighted Alternating Least Squares (WALS)

Weighted Alternating Least Squares (WALS), also called implicit Alternating Least Squares (iALS) or weighted regularized matrix factorization (WRMF), is a matrix factorization algorithm for collaborative filtering on implicit feedback data such as clicks, views, purchases, and play counts. It was introduced by Yifan Hu, Yehuda Koren, and Chris Volinsky at AT&T Labs in their 2008 paper "Collaborative Filtering for Implicit Feedback Datasets," presented at the IEEE International Conference on Data Mining (ICDM 2008). The same paper later won the IEEE ICDM Ten Year Highest Impact Paper Award in 2017, recognizing its central role in industrial recommendation systems.

WALS extends standard Alternating Least Squares by attaching a per-entry confidence weight c_ui to every user-item pair, then minimizing a weighted regularized squared error over the entire user-item matrix rather than only over observed ratings. Because the loss is a sum of weighted squared errors with a quadratic regularization term, the optimal user vector for any given item factor matrix (and vice versa) has a closed-form least squares solution. The algorithm alternates between solving for user vectors with item vectors fixed and solving for item vectors with user vectors fixed until convergence. A key algebraic trick in the original paper, sometimes called the "iALS efficiency reformulation," reduces the per-iteration cost from quadratic in the number of users and items to linear in the number of non-zero observations plus a single Gram-matrix update per iteration. This makes WALS practical on industrial data with hundreds of millions of users and tens of millions of items.

explain like I'm 5 (ELI5)

Imagine an ice cream shop that wants to know which flavors each customer would love. Some customers fill out star ratings ("I give chocolate 5 stars"). That is easy. But most customers do not bother with stars. They just walk in, point at a flavor, and walk out. The shop sees what they ordered, not what they thought of it. That is implicit feedback.

WALS is a way for the shop to guess flavors for everyone, even when nobody filled out star ratings. It treats every flavor and every customer as a small list of secret tastes, like "likes nutty," "likes fruity," "likes creamy." The algorithm makes those lists up at random, then keeps adjusting them so that ordered flavors get high scores and never-ordered flavors get low scores. To make sure the never-ordered ones do not drown out the few flavors a customer actually picked, the algorithm gives ordered flavors a much louder voice (a higher confidence weight). It flips back and forth: first it adjusts customer lists with flavor lists held still, then flavor lists with customer lists held still, and so on, until the guesses stop changing.

The finished lists let the shop predict a score for every customer-flavor pair, even ones nobody has ever tried. The highest scoring flavors for each customer become recommendations.

history

Before 2008, most public collaborative filtering research focused on explicit ratings, especially the five-star scale used by Netflix. The Netflix Prize (2006-2009) trained an entire generation of researchers on dense rating matrices with about 100 million observed ratings on a scale of 1 to 5. Algorithms such as biased SVD, SVD++, and Funk SVD optimized squared error over only the observed entries.

Industrial recommender systems faced a different problem. Television set-top boxes, music streaming services, web stores, and video sites collected huge volumes of behavioral data (channel tunes, song plays, purchases, page views) but very few star ratings. The available signal was implicit, noisy, and asymmetric: an interaction probably indicated some interest, but the absence of an interaction could mean dislike, ignorance, or simply that the user had never been exposed to the item. Treating non-interactions as zero ratings biased the model toward the most popular items, while ignoring them entirely (as in Funk SVD) discarded the bulk of the available signal.

Yifan Hu, Yehuda Koren, and Chris Volinsky proposed a clean solution in 2008. Working with set-top box television viewing data at AT&T Labs Research, they reformulated matrix factorization for implicit feedback in three steps:

Replace the raw observation r_ui with a binary preference variable p_ui (1 if any interaction occurred, 0 otherwise).
Attach a confidence weight c_ui that grows with the strength of the observed signal (typical form c_ui = 1 + alpha * r_ui).
Sum the weighted squared error over every user-item pair, then optimize with alternating least squares.

Because every pair contributes to the loss, the resulting factor vectors balance positive and negative signals naturally. The algebraic reformulation in the paper (often abbreviated as the "add-then-subtract" trick) keeps the per-iteration cost manageable: instead of iterating over m * n user-item pairs, the algorithm iterates over the nnz(R) observed pairs plus a small fixed number of operations involving an f x f Gram matrix, where f is the latent dimension.

The paper appeared at IEEE ICDM 2008 in Pisa, Italy, won the conference's Best Paper award that year, and went on to receive the ICDM 10-Year Highest-Impact Paper Award in 2017. The technique became known by several names. Hu, Koren, and Volinsky called it simply "the implicit feedback model." The Mahout project labelled it "WRMF" (weighted regularized matrix factorization). The Apache Spark MLlib team and TensorFlow's contributors called it "implicit ALS" or "WALS" (weighted ALS). All four names refer to the same family of algorithms.

mathematical formulation

Let R be a sparse m x n interaction matrix where m is the number of users, n is the number of items, and the entry r_ui records the strength of user u's interaction with item i (a play count, a click count, a watch fraction, or simply 1 for any interaction). WALS introduces two derived quantities:

A binary preference p_ui = 1 if r_ui > 0, else 0. The preference encodes whether any interaction occurred.
A confidence weight c_ui that grows with r_ui. The original paper uses two functional forms:
- Linear: c_ui = 1 + alpha * r_ui
- Logarithmic: c_ui = 1 + alpha * log(1 + r_ui / epsilon)

A value alpha = 40 is the well-known starting point from the paper, and epsilon is a small constant (1e-8 is common for the logarithmic form). Even a single observation gives a confidence above the baseline of 1 that all pairs share, so observed entries always have a stronger pull than unobserved ones.

objective function

WALS finds an m x f user factor matrix X and an n x f item factor matrix Y that minimize:

L(X, Y) = sum_{u, i} c_ui (p_ui - x_u^T y_i)^2 + lambda (sum_u ||x_u||^2 + sum_i ||y_i||^2)

The outer sum runs over all m * n user-item pairs, not just observed ones. Each pair contributes a weighted squared error between the binary preference p_ui and the dot product x_u^T y_i, and a quadratic regularization term keeps the magnitudes of the factor vectors in check. The hyperparameter lambda controls regularization strength and is typically chosen by cross-validation.

alternating updates

With the item matrix Y held fixed, the loss is a quadratic in each user vector x_u independently. Setting the gradient to zero gives the closed-form solution:

x_u = (Y^T C^u Y + lambda I)^{-1} Y^T C^u p(u)

where C^u is the n x n diagonal matrix with entries c_ui for the row of R belonging to user u, and p(u) is the column vector of preferences for that user. Symmetrically, with the user matrix X held fixed, each item vector has the closed-form solution:

y_i = (X^T C^i X + lambda I)^{-1} X^T C^i p(i)

WALS iterates between these two updates until convergence. Each step is provably non-increasing in the loss, which is why the algorithm is so stable in practice and requires almost no learning-rate tuning.

the efficiency trick

A naive implementation of Y^T C^u Y requires O(n * f^2) operations per user, leading to O(m * n * f^2) per outer iteration. That is intractable at industrial scale. The Hu, Koren, and Volinsky paper observes that:

Y^T C^u Y = Y^T Y + Y^T (C^u - I) Y

The first term, Y^T Y, does not depend on u and can be computed once per outer iteration in O(n * f^2) time. The second term, Y^T (C^u - I) Y, only involves rows of Y for items where c_ui > 1, which corresponds exactly to the items user u interacted with. For typical sparse data this is a small subset. The same decomposition applies to Y^T C^u p(u). The net effect is that each user update costs O(n_u * f^2 + f^3) operations, where n_u is the number of items that user u interacted with. Summed over all users, the per-iteration cost becomes O(nnz(R) * f^2 + (m + n) * f^3), which scales linearly with the number of observations.

This decomposition is what makes WALS feasible on real corpora. Without it, the same model would require billions of times more compute on a service with hundreds of millions of users.

prediction

After training, the predicted preference of user u for item i is the dot product:

r_hat_ui = x_u^T y_i

For recommendation, items are ranked by their predicted score for the user, optionally restricted to items the user has not yet interacted with. The dot-product structure also enables fast retrieval: candidate items can be located using approximate nearest neighbour search on the item factor matrix, with libraries such as FAISS or ScaNN providing sub-millisecond top-K retrieval over millions of items.

implicit vs explicit feedback

WALS exists because implicit feedback differs structurally from explicit ratings. The contrast drives the algorithm's design choices.

Aspect	Explicit feedback	Implicit feedback
Example signals	1-5 star ratings, thumbs up/down	Clicks, page views, purchases, plays, watch time
User intent	Direct preference statement	Behavioural proxy for preference
Coverage	Sparse; few users rate explicitly	Plentiful; every interaction is logged
Noise level	Lower (users opted in to rate)	Higher (a click is not always interest)
Missing data interpretation	Unknown rating	Possibly negative, possibly unobserved
Negative examples	Low star ratings	None directly available
Range	Bounded (e.g., 1 to 5)	Unbounded counts
Loss target	Predict the rating	Predict the preference, weighted by confidence
Sum over	Observed entries only	All user-item pairs (with weights)
Standard algorithm	Funk SVD, biased SVD, SVD++	WALS / iALS, BPR-MF
Canonical paper	Koren, Bell, Volinsky 2009 (IEEE Computer)	Hu, Koren, Volinsky 2008 (ICDM)

The choice between explicit and implicit feedback is not binary in practice. Many production systems blend both. SVD++ and timeSVD++, for example, augment explicit rating models with implicit "the user looked at this" signals. WALS sits firmly on the implicit side and is the workhorse for systems where explicit ratings are rare or absent.

the WALS algorithm step by step

The pseudocode below summarises the canonical WALS algorithm.

Inputs. A sparse m x n interaction matrix R with values r_ui >= 0, latent dimension f, regularization lambda, confidence weight scale alpha, number of iterations T.
Build derived quantities. For each non-zero r_ui, compute the preference p_ui = 1 and the confidence c_ui = 1 + alpha * r_ui. All other pairs have p_ui = 0 and c_ui = 1.
Initialise X and Y with small random values (commonly Gaussian with standard deviation 1/sqrt(f)).
For t = 1, 2, ..., T do: a. Compute the global Gram matrix G_y = Y^T Y once. b. For each user u, compute A_u = G_y + Y_u^T diag(c_ui - 1) Y_u + lambda I and b_u = Y_u^T diag(c_ui) p(u), where Y_u contains only the rows of Y for items user u interacted with. Solve A_u x_u = b_u to update x_u. c. Compute the global Gram matrix G_x = X^T X once. d. For each item i, perform the symmetric update for y_i.
Stop when the loss stops improving by more than a tolerance, or after T iterations.
Predict r_hat_ui = x_u^T y_i. Rank items per user for recommendations.

A few practical refinements appear in most production-grade implementations:

Item biases are sometimes added by extending each user vector with a constant 1 and each item vector with a free bias coordinate.
Frequency-aware regularization scales lambda by the number of interactions the user or item has, as in Rendle, Krichene, Zhang, and Anderson's 2022 "Revisiting iALS" paper, which showed that careful regularization tuning lets iALS match deep collaborative filtering baselines.
Conjugate gradient can replace the direct linear solve when f is large, trading a slightly higher per-iteration cost for fewer outer iterations and lower memory.
Negative sampling is sometimes added on top of the dense unobserved loss to focus the model on "hard negatives" rather than treating all unobserved items equally.

confidence weighting strategies

The confidence function c_ui is the lever that distinguishes WALS from naive matrix factorization on a 0/1 matrix. Different application domains favour different functional forms.

Strategy	Formula	When to use
Constant	c_ui = 1 + alpha for all observed	Pure implicit binary signal (visited / did not visit)
Linear	c_ui = 1 + alpha * r_ui	Counts grow modestly; the original Hu/Koren/Volinsky default
Logarithmic	c_ui = 1 + alpha * log(1 + r_ui / epsilon)	Heavy-tailed counts; compresses the dynamic range
Square root	c_ui = 1 + alpha * sqrt(r_ui)	Balance between linear and log
Exposure-aware	c_ui depends on whether the item was actually shown	When impression logs are available
Time-decayed	c_ui = 1 + alpha * sum_t r_uti * decay(t)	Streaming services where recency matters

A larger alpha puts more weight on observed pairs and effectively makes the algorithm fit positive signals harder. A smaller alpha makes the algorithm closer to a uniform-weight low-rank approximation of the binary preference matrix. In practice alpha is tuned by cross-validation on a held-out set.

implementations

WALS is widely implemented across the open source ecosystem. The table below summarises the most-used libraries and frameworks.

Library / framework	Language	API / class	Notes
Apache Spark MLlib	Scala / Python / Java	`pyspark.ml.recommendation.ALS` with `implicitPrefs=True`	Distributed implementation; partitions users and items across cluster
implicit	Python (Cython + CUDA)	`implicit.als.AlternatingLeastSquares`	Single-machine; CPU and GPU; the de facto standard for medium-scale iALS
TensorFlow contrib (deprecated)	Python	`tf.contrib.factorization.WALSMatrixFactorization`	Native TensorFlow implementation; used at Google before Keras-era recommenders
TensorFlow Recommenders	Python	Two-tower retrieval	Modern Google recommendations stack; dot-product MF as a special case
Apache Mahout	Java	`ParallelALSFactorizationJob` ("WRMF" mode)	One of the earliest open implementations; Hadoop-based
Surprise	Python	(No native iALS; supports SVD, NMF, KNN-based)	Mostly explicit-feedback algorithms
LensKit	Python	`lenskit.algorithms.als.ImplicitMF`	Research-oriented; reproducible recommender benchmarks
LightFM	Python	(Hybrid; uses WARP / BPR loss rather than WALS)	Often compared against iALS as a baseline
Microsoft Recommenders	Python	Wraps Spark's ALS for implicit feedback	Includes notebooks reproducing benchmark results
QMF	C++	Quora's matrix factorization library	Fast multithreaded WRMF

Benfred Frederickson's implicit library deserves special mention. Its CUDA kernels can train a 100-factor iALS model on the Last.fm 1B dataset in minutes on a single GPU, an order of magnitude faster than the Spark implementation on a comparable cluster. The library also exposes BPR and Logistic Matrix Factorization, allowing direct comparison between WALS and ranking-based alternatives.

use cases

WALS has been applied across virtually every domain where implicit feedback dominates the available signal.

streaming media

Music streaming services were early adopters. Spotify documented in a 2014 blog series by Erik Bernhardsson that its core collaborative filtering pipeline was built on iALS over user listening histories, with item embeddings then indexed by Annoy for approximate nearest neighbour search. Bernhardsson's open-source Annoy library was originally developed at Spotify partly to serve those WALS embeddings for retrieval. Even after Spotify moved much of its retrieval to two-tower neural models in the late 2010s, iALS remained a strong production baseline and an honest benchmark for new architectures.

video recommendation

The original Hu, Koren, and Volinsky paper used TV viewing data from set-top boxes: implicit signals from a US cable provider with millions of subscribers and tens of thousands of channels and shows. YouTube has published several recommender architectures over the years, including a 2016 paper ("Deep Neural Networks for YouTube Recommendations") that explicitly contrasts its deep retrieval model against matrix factorization baselines of the WALS family.

e-commerce

Large online stores use WALS over implicit purchase, view, and add-to-cart signals. Etsy described its use of an iALS-style model in a 2014 engineering blog post for personalised recommendations on its homepage, and Quora open-sourced its QMF library specifically to scale WRMF to its question-answering platform. The technique is also a standard in the AWS Personalize service's repertoire of off-the-shelf algorithms for implicit feedback.

advertising and content ranking

Click-through rate prediction systems often start with a WALS-style embedding learned over historical clicks, then enrich it with side features through models such as factorization machines, DLRM, or deep & cross networks. Even when downstream rankers are deep, the candidate generation step is frequently a dot-product retrieval over WALS-trained embeddings.

personalised search

Query-document interaction matrices can be factorized with WALS to produce query and document embeddings used as additional signals in learning-to-rank pipelines. Microsoft, Bing, and Yahoo have all published papers using iALS-style techniques as feature generators or candidate retrievers in personalised search.

scientific applications

Beyond classic recommendations, WALS appears in several scientific domains. Genomics studies use it to recover patient-mutation associations from sparse count matrices. Network analysis applies WALS to citation and collaboration graphs. Educational technology platforms use it to recommend exercises by treating student-question interactions as implicit feedback.

comparison with other matrix factorization methods

WALS is one of several matrix factorization algorithms. The choice depends on the data type, the loss function, and the engineering constraints.

Method	Loss	Feedback	Optimization	Strengths	Weaknesses
Truncated SVD	Frobenius reconstruction	Fully observed dense matrix	Lanczos / randomized SVD	Optimal under squared error; closed form	Cannot handle missing data natively; treats zeros as observations
Funk SVD	Squared error on observed	Explicit	SGD	Simple; fast on sparse explicit data	Ignores unobserved pairs; weak on implicit feedback
Biased SVD	Squared error with biases	Explicit	SGD or ALS	Captures user/item bias; very strong on Netflix-style data	Same weakness on implicit feedback
SVD++	Squared error with implicit term	Mixed	SGD	Combines explicit ratings with implicit signals	More expensive than basic SVD
NMF	KL or Frobenius	Counts (non-negative)	Multiplicative updates / ANLS	Interpretable parts; non-negative factors	No native confidence weighting for implicit
WALS / iALS	Weighted squared error over all pairs	Implicit	Alternating closed-form least squares	Industrial-scale implicit feedback; embeds well for ANN	Trains a global model; not personalised online
BPR-MF	Pairwise ranking (logistic)	Implicit	SGD over (user, pos, neg) triples	Optimises ranking directly; strong top-K performance	Triplet sampling can be slow; less stable than WALS
Logistic Matrix Factorization	Binary cross-entropy weighted	Implicit	SGD or ALS variants	Probabilistic interpretation; calibrated outputs	Often comparable to iALS but more hyperparameters
Probabilistic Matrix Factorization (PMF)	Gaussian likelihood + Gaussian prior	Explicit	MAP via SGD	Bayesian interpretation of regularization	Not designed for implicit feedback
Bayesian PMF	Same as PMF with hyperpriors	Explicit	MCMC	Automatic hyperparameter selection	Higher computational cost
Factorization Machines	Squared / logistic with side features	Either	SGD / ALS / MCMC	Handles arbitrary features	Loses some scalability benefits of pure MF
CCD++	Squared error rank-one updates	Either	Cyclic coordinate descent	Very fast on multi-core CPUs	Less common in modern toolchains
Deep Matrix Factorization	Reconstruction with neural transforms	Either	SGD	Captures non-linearities	Often beaten by tuned iALS in benchmarks

WALS vs SVD

Classical truncated SVD operates on a fully observed dense matrix and assumes uniform error variance. Applying it to an implicit feedback matrix means treating unobserved pairs as hard zeros, which biases the decomposition toward popular items. WALS keeps the dense formulation but downweights unobserved pairs by setting their confidence to 1 while observed pairs get confidence 1 + alpha * r_ui. This single change converts SVD-style decomposition into a method that respects the asymmetry of implicit data.

WALS vs Funk SVD

Funk SVD only computes loss over observed entries, which works well for explicit ratings but discards a key signal in implicit data: the absence of an interaction. WALS includes every pair in the loss with a small but non-zero weight, allowing the model to push apart observed and unobserved items in latent space.

WALS vs BPR

Both WALS and Bayesian Personalized Ranking (BPR) target implicit feedback, but they optimize different objectives. WALS minimizes weighted squared error over all pairs; BPR minimizes a pairwise ranking loss between observed and unobserved items. Empirically, BPR can produce sharper top-K rankings on dense datasets, while WALS is more stable, easier to parallelise, and converges in fewer iterations. The 2022 "Revisiting the Performance of iALS" paper by Rendle, Krichene, Zhang, and Anderson showed that with careful tuning of regularization, sampling, and frequency weighting, iALS matches or exceeds BPR on standard benchmarks (MovieLens, Pinterest, Million Songs).

WALS vs NMF

Non-negative matrix factorization constrains all factor entries to be non-negative, producing parts-based representations that are often interpretable. WALS has no sign constraint and is optimised for prediction quality on implicit data rather than interpretability. NMF can be adapted to implicit feedback (see Hu, Koren, and Volinsky's discussion of weighted NMF), but the WALS formulation is usually faster and easier to implement.

modern alternatives

Deep learning and self-supervised methods have produced several alternatives to WALS for implicit feedback recommendation.

Method	Year	Architecture	Pros vs WALS	Cons vs WALS
Neural Collaborative Filtering (NCF)	2017	Two-branch MLP combining GMF and MLP	Captures non-linearities; flexible inputs	Slower; Rendle 2020 showed properly tuned WALS often matches it
Variational Autoencoder for CF (Mult-VAE)	2018	Encoder-decoder with multinomial likelihood	Strong on dense datasets; principled probabilistic loss	Larger memory footprint; harder retrieval
Two-tower retrieval	~2016+	Separate user and item encoders	Native support for side features	Higher engineering complexity
Deep Learning Recommendation Model (DLRM)	2019 (Meta)	Embedding tables + MLP + feature interactions	Native handling of categorical and numeric features	Designed for ranking, not retrieval
Wide & Deep	2016 (Google)	Linear wide branch + deep MLP branch	Generalisation plus memorisation	Heavier than WALS; needs feature engineering
SASRec	2018	Self-attention over user history	Captures sequential preferences	Requires user histories; offline retraining
BERT4Rec	2019	Bidirectional transformer	Stronger sequence modelling	Even higher compute cost
GRU4Rec	2016	Gated recurrent network over sessions	Good for session-based recommendation	Less of a general retrieval model
EASE / EASE^R	2019	Item-item linear autoencoder	Closed-form; sometimes beats neural baselines	O(n^2) memory in items
LightGCN	2020	Linearised graph convolution on user-item graph	Strong on benchmarks; conceptually simple	Graph propagation cost

A recurring theme in this list is that carefully tuned WALS remains a competitive baseline. Steffen Rendle, Walid Krichene, Li Zhang, and John Anderson's 2020 RecSys paper "Neural Collaborative Filtering vs. Matrix Factorization Revisited" reproduced the original NCF experiments and showed that a properly hyperparameter-tuned dot-product matrix factorization outperforms the MLP-based learned similarity. Their 2022 follow-up, "Revisiting the Performance of iALS," showed the same for WALS specifically: with appropriate frequency-aware regularization, item bias, and the confidence functional form chosen by cross-validation, iALS matches or beats Mult-VAE, EASE, and LightGCN on the MovieLens-20M, MSD, and Yelp benchmarks.

In industry the picture is similar. Most large recommender pipelines combine WALS-style retrieval with a deep ranker. WALS-trained item embeddings feed an approximate nearest neighbour index (FAISS, ScaNN, Annoy) to generate a few hundred candidates per user in real time; a downstream neural model with side features then re-ranks those candidates. The role of WALS has narrowed from end-to-end recommender to fast candidate generator, but it remains an essential ingredient.

hyperparameters and tuning

The main hyperparameters of WALS are:

Hyperparameter	Symbol	Typical values	Notes
Latent dimension	`f`	32 to 512	Higher gives more capacity but more memory and compute
Regularization	`lambda`	1e-3 to 1e1	Tune by cross-validation; frequency-aware variant scales by user/item interaction count
Confidence weight scale	`alpha`	1 to 100	Larger pulls observed pairs harder; the original paper used 40
Confidence functional form	linear or log	Pick by experiment	Log compresses heavy-tailed counts
Number of iterations	`T`	10 to 30	Loss usually plateaus quickly
Bias term	optional	on / off	Frequently improves accuracy at trivial cost

Unlike SGD, WALS has no learning rate to tune. The closed-form alternation removes a major source of fragility from the training process, which is one of the reasons it has remained popular over more than fifteen years.

limitations

WALS has well-known limitations that drove the development of richer methods.

Linear interactions only. The dot product is symmetric and unable to capture complex non-linear interactions between users and items. Neural methods can in principle do better, although the Rendle 2020 results suggest the gap is smaller than it first appeared.
Cold start. WALS produces a factor vector only for users and items present in the training matrix. New users and new items have no representation until they accumulate enough interactions. Side-information extensions (LightFM, factorization machines, two-tower models) address this.
Static model. Standard WALS does not model temporal drift in user preferences. The iALS family can be retrained periodically (commonly daily or weekly) but does not naturally produce streaming updates between retrains.
Popularity bias. Including all unobserved pairs with confidence 1 effectively encodes a uniform negative prior, which can over-recommend popular items if not corrected. Inverse propensity weighting and exposure-aware confidence functions help.
Memory growth. The factor matrices X and Y grow linearly with the number of users and items. For services with billions of users, hash embedding tricks, quantisation, or feature hashing become necessary.
Identifiability. Like all matrix factorization, WALS solutions are unique only up to invertible transformations of the latent space, so latent dimensions are not directly comparable across runs.
Single objective. WALS optimises one loss. Multi-task and multi-objective recommendation (relevance, diversity, fairness, exploration) require additional machinery on top.
No native side features. Plain WALS uses only the user-item matrix. To incorporate item categories, user demographics, or contextual signals, the model needs to be combined with factorization machines or replaced by a two-tower / DLRM-style architecture.

Despite these limitations, WALS continues to be deployed in production at companies such as Spotify, Netflix, YouTube, Etsy, and Quora, often as the candidate generator inside a larger neural ranking system. It is fast to train, easy to tune, robust to noise, and produces embeddings that integrate cleanly with approximate nearest neighbour search.

relationship to other concepts

WALS sits at the intersection of several key ideas in recommendation and representation learning.

It is a special case of weighted matrix factorization, where the weights happen to be derived from a confidence function.
It generalises truncated SVD to handle non-uniform observation noise.
It is closely related to Logistic Matrix Factorization, which uses a binary cross-entropy loss instead of squared error but otherwise has a very similar structure.
The user and item factor matrices can be viewed as embeddings, the same construct used in deep learning models. Two-tower retrieval models, in their simplest dot-product form, reduce to WALS-style matrix factorization with neural side feature processing on each tower.
The decomposition Y^T C^u Y = Y^T Y + Y^T (C^u - I) Y is a special case of the Sherman-Morrison-Woodbury identity in numerical linear algebra.
The 2014 Levy and Goldberg result that Word2Vec is implicit matrix factorization parallels the WALS formulation: both methods learn embeddings by approximating a co-occurrence (or interaction) matrix with a low-rank structure under a non-uniform weighting scheme.

references

Hu, Y., Koren, Y., & Volinsky, C. (2008). "Collaborative Filtering for Implicit Feedback Datasets." Proceedings of the 8th IEEE International Conference on Data Mining (ICDM), 263-272. http://yifanhu.net/PUB/cf.pdf
Koren, Y., Bell, R., & Volinsky, C. (2009). "Matrix Factorization Techniques for Recommender Systems." IEEE Computer, 42(8), 30-37. https://ieeexplore.ieee.org/document/5197422
Koren, Y. (2008). "Factorization Meets the Neighborhood: A Multifaceted Collaborative Filtering Model." Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 426-434.
Pan, R., Zhou, Y., Cao, B., Liu, N. N., Lukose, R., Scholz, M., & Yang, Q. (2008). "One-Class Collaborative Filtering." Proceedings of the 8th IEEE International Conference on Data Mining (ICDM), 502-511. https://ieeexplore.ieee.org/document/4781145
Rendle, S., Freudenthaler, C., Gantner, Z., & Schmidt-Thieme, L. (2009). "BPR: Bayesian Personalized Ranking from Implicit Feedback." Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence (UAI), 452-461. https://arxiv.org/abs/1205.2618
Rendle, S., Krichene, W., Zhang, L., & Anderson, J. (2020). "Neural Collaborative Filtering vs. Matrix Factorization Revisited." Proceedings of the 14th ACM Conference on Recommender Systems (RecSys). https://arxiv.org/abs/2005.09683
Rendle, S., Krichene, W., Zhang, L., & Anderson, J. (2022). "Revisiting the Performance of iALS on Item Recommendation Benchmarks." Proceedings of the 16th ACM Conference on Recommender Systems (RecSys). https://arxiv.org/abs/2110.14037
Johnson, C. C. (2014). "Logistic Matrix Factorization for Implicit Feedback Data." NeurIPS Workshop on Distributed Machine Learning and Matrix Computations. https://web.stanford.edu/~rezab/nips2014workshop/submits/logmat.pdf
He, X., Liao, L., Zhang, H., Nie, L., Hu, X., & Chua, T.-S. (2017). "Neural Collaborative Filtering." Proceedings of the 26th International Conference on World Wide Web (WWW), 173-182. https://arxiv.org/abs/1708.05031
Liang, D., Krishnan, R. G., Hoffman, M. D., & Jebara, T. (2018). "Variational Autoencoders for Collaborative Filtering." Proceedings of the World Wide Web Conference (WWW). https://arxiv.org/abs/1802.05814
Steck, H. (2019). "Embarrassingly Shallow Autoencoders for Sparse Data." Proceedings of the World Wide Web Conference (WWW), 3251-3257. https://arxiv.org/abs/1905.03375
He, X., Deng, K., Wang, X., Li, Y., Zhang, Y., & Wang, M. (2020). "LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation." Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. https://arxiv.org/abs/2002.02126
Naumov, M., Mudigere, D., Shi, H. M., et al. (2019). "Deep Learning Recommendation Model for Personalization and Recommendation Systems." Meta AI technical report. https://arxiv.org/abs/1906.00091
Cheng, H.-T., Koc, L., Harmsen, J., et al. (2016). "Wide & Deep Learning for Recommender Systems." Proceedings of the 1st Workshop on Deep Learning for Recommender Systems. https://arxiv.org/abs/1606.07792
Covington, P., Adams, J., & Sargin, E. (2016). "Deep Neural Networks for YouTube Recommendations." Proceedings of the 10th ACM Conference on Recommender Systems (RecSys), 191-198. https://research.google/pubs/pub45530/
Bernhardsson, E. (2014). "Music recommendations at Spotify." Engineering blog post and conference talks. https://erikbern.com/2014/02/01/how-music-recommendations-work-and-dont-work.html
Spark MLlib documentation. "Collaborative Filtering with Alternating Least Squares." Apache Software Foundation. https://spark.apache.org/docs/latest/ml-collaborative-filtering.html
Frederickson, B. (2017). "Implicit: Fast Python Collaborative Filtering for Implicit Datasets." GitHub repository. https://github.com/benfred/implicit
Quora Engineering. (2016). "QMF: A Matrix Factorization Library." GitHub repository. https://github.com/quora/qmf
TensorFlow contributors. "WALSMatrixFactorization." TensorFlow Estimator API documentation (deprecated in TF 2.x in favour of TensorFlow Recommenders). https://www.tensorflow.org/recommenders
IEEE ICDM. (2017). "ICDM 2017 Awards: Ten Year Highest-Impact Paper Award presented to Yifan Hu, Yehuda Koren, and Chris Volinsky for Collaborative Filtering for Implicit Feedback Datasets." https://www.cs.uvm.edu/~icdm/awards.shtml
Levy, O., & Goldberg, Y. (2014). "Neural Word Embedding as Implicit Matrix Factorization." Advances in Neural Information Processing Systems (NIPS), 27. https://papers.nips.cc/paper/5477-neural-word-embedding-as-implicit-matrix-factorization
Salakhutdinov, R., & Mnih, A. (2008). "Bayesian Probabilistic Matrix Factorization using Markov Chain Monte Carlo." Proceedings of the 25th International Conference on Machine Learning (ICML). https://www.cs.toronto.edu/~amnih/papers/bpmf.pdf
Mnih, A., & Salakhutdinov, R. (2007). "Probabilistic Matrix Factorization." Advances in Neural Information Processing Systems (NIPS), 20. https://papers.nips.cc/paper/2007/hash/d7322ed717dedf1eb4e6e52a37ea7bcd-Abstract.html
Funk, S. (2006). "Netflix Update: Try This at Home." Blog post, December 11, 2006. https://sifter.org/simon/journal/20061211.html

explain like I'm 5 (ELI5)

history

mathematical formulation

objective function

alternating updates

the efficiency trick

prediction

implicit vs explicit feedback

the WALS algorithm step by step

confidence weighting strategies

implementations

use cases

streaming media

video recommendation

e-commerce

advertising and content ranking

personalised search

scientific applications

comparison with other matrix factorization methods

WALS vs SVD

WALS vs Funk SVD

WALS vs BPR

WALS vs NMF

modern alternatives

hyperparameters and tuning

limitations

relationship to other concepts

see also

references

Improve this article

Related Articles

TikTok

Item matrix

User matrix

Social Media

Tower

Two-Tower Model

explain like I'm 5 (ELI5)

history

mathematical formulation

objective function

alternating updates

the efficiency trick

prediction

implicit vs explicit feedback

the WALS algorithm step by step

confidence weighting strategies

implementations

use cases

streaming media

video recommendation

e-commerce

advertising and content ranking

personalised search

scientific applications

comparison with other matrix factorization methods

WALS vs SVD

WALS vs Funk SVD

WALS vs BPR

WALS vs NMF

modern alternatives

hyperparameters and tuning

limitations

relationship to other concepts

see also

references

Related Articles

TikTok

Item matrix

User matrix

Social Media

Tower

Two-Tower Model