Weighted Alternating Least Squares (WALS)
Last reviewed
Jun 2, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v5 ยท 5,555 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 2, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v5 ยท 5,555 words
Add missing citations, update stale details, or suggest a clearer explanation.
Weighted Alternating Least Squares (WALS), also called implicit Alternating Least Squares (iALS) or weighted regularized matrix factorization (WRMF), is a matrix factorization algorithm for collaborative filtering on implicit feedback data such as clicks, views, purchases, and play counts. It was introduced by Yifan Hu, Yehuda Koren, and Chris Volinsky at AT&T Labs in their 2008 paper "Collaborative Filtering for Implicit Feedback Datasets," presented at the IEEE International Conference on Data Mining (ICDM 2008).[1] The same paper later won the IEEE ICDM Ten Year Highest Impact Paper Award in 2017, recognizing its central role in industrial recommendation systems.[21]
WALS extends standard Alternating Least Squares by attaching a per-entry confidence weight c_ui to every user-item pair, then minimizing a weighted regularized squared error over the entire user-item matrix rather than only over observed ratings. Because the loss is a sum of weighted squared errors with a quadratic regularization term, the optimal user vector for any given item factor matrix (and vice versa) has a closed-form least squares solution. The algorithm alternates between solving for user vectors with item vectors fixed and solving for item vectors with user vectors fixed until convergence. A key algebraic trick in the original paper, sometimes called the "iALS efficiency reformulation," reduces the per-iteration cost from quadratic in the number of users and items to linear in the number of non-zero observations plus a single Gram-matrix update per iteration.[1] This makes WALS practical on industrial data with hundreds of millions of users and tens of millions of items.
Imagine an ice cream shop that wants to know which flavors each customer would love. Some customers fill out star ratings ("I give chocolate 5 stars"). That is easy. But most customers do not bother with stars. They just walk in, point at a flavor, and walk out. The shop sees what they ordered, not what they thought of it. That is implicit feedback.
WALS is a way for the shop to guess flavors for everyone, even when nobody filled out star ratings. It treats every flavor and every customer as a small list of secret tastes, like "likes nutty," "likes fruity," "likes creamy." The algorithm makes those lists up at random, then keeps adjusting them so that ordered flavors get high scores and never-ordered flavors get low scores. To make sure the never-ordered ones do not drown out the few flavors a customer actually picked, the algorithm gives ordered flavors a much louder voice (a higher confidence weight). It flips back and forth: first it adjusts customer lists with flavor lists held still, then flavor lists with customer lists held still, and so on, until the guesses stop changing.
The finished lists let the shop predict a score for every customer-flavor pair, even ones nobody has ever tried. The highest scoring flavors for each customer become recommendations.
Before 2008, most public collaborative filtering research focused on explicit ratings, especially the five-star scale used by Netflix. The Netflix Prize (2006-2009) trained an entire generation of researchers on dense rating matrices with about 100 million observed ratings on a scale of 1 to 5. Algorithms such as biased SVD,[2] SVD++,[3] and Funk SVD[25] optimized squared error over only the observed entries.
Industrial recommender systems faced a different problem. Television set-top boxes, music streaming services, web stores, and video sites collected huge volumes of behavioral data (channel tunes, song plays, purchases, page views) but very few star ratings. The available signal was implicit, noisy, and asymmetric: an interaction probably indicated some interest, but the absence of an interaction could mean dislike, ignorance, or simply that the user had never been exposed to the item. Treating non-interactions as zero ratings biased the model toward the most popular items, while ignoring them entirely (as in Funk SVD) discarded the bulk of the available signal.[4]
Yifan Hu, Yehuda Koren, and Chris Volinsky proposed a clean solution in 2008. Working with set-top box television viewing data at AT&T Labs Research, they reformulated matrix factorization for implicit feedback in three steps:[1]
r_ui with a binary preference variable p_ui (1 if any interaction occurred, 0 otherwise).c_ui that grows with the strength of the observed signal (typical form c_ui = 1 + alpha * r_ui).Because every pair contributes to the loss, the resulting factor vectors balance positive and negative signals naturally. The algebraic reformulation in the paper (often abbreviated as the "add-then-subtract" trick) keeps the per-iteration cost manageable: instead of iterating over m * n user-item pairs, the algorithm iterates over the nnz(R) observed pairs plus a small fixed number of operations involving an f x f Gram matrix, where f is the latent dimension.
The paper appeared at IEEE ICDM 2008 in Pisa, Italy, won the conference's Best Paper award that year, and went on to receive the ICDM 10-Year Highest-Impact Paper Award in 2017.[21] The technique became known by several names. Hu, Koren, and Volinsky called it simply "the implicit feedback model." The Mahout project labelled it "WRMF" (weighted regularized matrix factorization). The Apache Spark MLlib team and TensorFlow's contributors called it "implicit ALS" or "WALS" (weighted ALS).[20] All four names refer to the same family of algorithms.
Let R be a sparse m x n interaction matrix where m is the number of users, n is the number of items, and the entry r_ui records the strength of user u's interaction with item i (a play count, a click count, a watch fraction, or simply 1 for any interaction). WALS introduces two derived quantities:
p_ui = 1 if r_ui > 0, else 0. The preference encodes whether any interaction occurred.c_ui that grows with r_ui. The original paper uses two functional forms:
c_ui = 1 + alpha * r_uic_ui = 1 + alpha * log(1 + r_ui / epsilon)A value alpha = 40 is the well-known starting point from the paper, and epsilon is a small constant (1e-8 is common for the logarithmic form).[1] Even a single observation gives a confidence above the baseline of 1 that all pairs share, so observed entries always have a stronger pull than unobserved ones.
WALS finds an m x f user factor matrix X and an n x f item factor matrix Y that minimize:
L(X, Y) = sum_{u, i} c_ui (p_ui - x_u^T y_i)^2 + lambda (sum_u ||x_u||^2 + sum_i ||y_i||^2)
The outer sum runs over all m * n user-item pairs, not just observed ones. Each pair contributes a weighted squared error between the binary preference p_ui and the dot product x_u^T y_i, and a quadratic regularization term keeps the magnitudes of the factor vectors in check. The hyperparameter lambda controls regularization strength and is typically chosen by cross-validation.
With the item matrix Y held fixed, the loss is a quadratic in each user vector x_u independently. Setting the gradient to zero gives the closed-form solution:
x_u = (Y^T C^u Y + lambda I)^{-1} Y^T C^u p(u)
where C^u is the n x n diagonal matrix with entries c_ui for the row of R belonging to user u, and p(u) is the column vector of preferences for that user. Symmetrically, with the user matrix X held fixed, each item vector has the closed-form solution:
y_i = (X^T C^i X + lambda I)^{-1} X^T C^i p(i)
WALS iterates between these two updates until convergence. Each step is provably non-increasing in the loss, which is why the algorithm is so stable in practice and requires almost no learning-rate tuning.
A naive implementation of Y^T C^u Y requires O(n * f^2) operations per user, leading to O(m * n * f^2) per outer iteration. That is intractable at industrial scale. The Hu, Koren, and Volinsky paper observes that:[1]
Y^T C^u Y = Y^T Y + Y^T (C^u - I) Y
The first term, Y^T Y, does not depend on u and can be computed once per outer iteration in O(n * f^2) time. The second term, Y^T (C^u - I) Y, only involves rows of Y for items where c_ui > 1, which corresponds exactly to the items user u interacted with. For typical sparse data this is a small subset. The same decomposition applies to Y^T C^u p(u). The net effect is that each user update costs O(n_u * f^2 + f^3) operations, where n_u is the number of items that user u interacted with. Summed over all users, the per-iteration cost becomes O(nnz(R) * f^2 + (m + n) * f^3), which scales linearly with the number of observations.
This decomposition is what makes WALS feasible on real corpora. Without it, the same model would require billions of times more compute on a service with hundreds of millions of users.
After training, the predicted preference of user u for item i is the dot product:
r_hat_ui = x_u^T y_i
For recommendation, items are ranked by their predicted score for the user, optionally restricted to items the user has not yet interacted with. The dot-product structure also enables fast retrieval: candidate items can be located using approximate nearest neighbour search on the item factor matrix, with libraries such as FAISS or ScaNN providing sub-millisecond top-K retrieval over millions of items.
WALS exists because implicit feedback differs structurally from explicit ratings. The contrast drives the algorithm's design choices.
| Aspect | Explicit feedback | Implicit feedback |
|---|---|---|
| Example signals | 1-5 star ratings, thumbs up/down | Clicks, page views, purchases, plays, watch time |
| User intent | Direct preference statement | Behavioural proxy for preference |
| Coverage | Sparse; few users rate explicitly | Plentiful; every interaction is logged |
| Noise level | Lower (users opted in to rate) | Higher (a click is not always interest) |
| Missing data interpretation | Unknown rating | Possibly negative, possibly unobserved |
| Negative examples | Low star ratings | None directly available |
| Range | Bounded (e.g., 1 to 5) | Unbounded counts |
| Loss target | Predict the rating | Predict the preference, weighted by confidence |
| Sum over | Observed entries only | All user-item pairs (with weights) |
| Standard algorithm | Funk SVD, biased SVD, SVD++ | WALS / iALS, BPR-MF |
| Canonical paper | Koren, Bell, Volinsky 2009 (IEEE Computer) | Hu, Koren, Volinsky 2008 (ICDM) |
The choice between explicit and implicit feedback is not binary in practice. Many production systems blend both. SVD++ and timeSVD++, for example, augment explicit rating models with implicit "the user looked at this" signals.[3] WALS sits firmly on the implicit side and is the workhorse for systems where explicit ratings are rare or absent.
The pseudocode below summarises the canonical WALS algorithm.
m x n interaction matrix R with values r_ui >= 0, latent dimension f, regularization lambda, confidence weight scale alpha, number of iterations T.r_ui, compute the preference p_ui = 1 and the confidence c_ui = 1 + alpha * r_ui. All other pairs have p_ui = 0 and c_ui = 1.X and Y with small random values (commonly Gaussian with standard deviation 1/sqrt(f)).t = 1, 2, ..., T do:
a. Compute the global Gram matrix G_y = Y^T Y once.
b. For each user u, compute A_u = G_y + Y_u^T diag(c_ui - 1) Y_u + lambda I and b_u = Y_u^T diag(c_ui) p(u), where Y_u contains only the rows of Y for items user u interacted with. Solve A_u x_u = b_u to update x_u.
c. Compute the global Gram matrix G_x = X^T X once.
d. For each item i, perform the symmetric update for y_i.T iterations.r_hat_ui = x_u^T y_i. Rank items per user for recommendations.A few practical refinements appear in most production-grade implementations:
1 and each item vector with a free bias coordinate.lambda by the number of interactions the user or item has, as in Rendle, Krichene, Zhang, and Anderson's 2022 "Revisiting iALS" paper, which showed that careful regularization tuning lets iALS match deep collaborative filtering baselines.[7]f is large, trading a slightly higher per-iteration cost for fewer outer iterations and lower memory.The confidence function c_ui is the lever that distinguishes WALS from naive matrix factorization on a 0/1 matrix. Different application domains favour different functional forms.
| Strategy | Formula | When to use |
|---|---|---|
| Constant | c_ui = 1 + alpha for all observed | Pure implicit binary signal (visited / did not visit) |
| Linear | c_ui = 1 + alpha * r_ui | Counts grow modestly; the original Hu/Koren/Volinsky default |
| Logarithmic | c_ui = 1 + alpha * log(1 + r_ui / epsilon) | Heavy-tailed counts; compresses the dynamic range |
| Square root | c_ui = 1 + alpha * sqrt(r_ui) | Balance between linear and log |
| Exposure-aware | c_ui depends on whether the item was actually shown | When impression logs are available |
| Time-decayed | c_ui = 1 + alpha * sum_t r_uti * decay(t) | Streaming services where recency matters |
A larger alpha puts more weight on observed pairs and effectively makes the algorithm fit positive signals harder. A smaller alpha makes the algorithm closer to a uniform-weight low-rank approximation of the binary preference matrix. In practice alpha is tuned by cross-validation on a held-out set.
WALS is widely implemented across the open source ecosystem. The table below summarises the most-used libraries and frameworks.
| Library / framework | Language | API / class | Notes |
|---|---|---|---|
| Apache Spark MLlib | Scala / Python / Java | pyspark.ml.recommendation.ALS with implicitPrefs=True | Distributed implementation; partitions users and items across cluster |
| implicit | Python (Cython + CUDA) | implicit.als.AlternatingLeastSquares | Single-machine; CPU and GPU; the de facto standard for medium-scale iALS |
| TensorFlow contrib (deprecated) | Python | tf.contrib.factorization.WALSMatrixFactorization | Native TensorFlow implementation; used at Google before Keras-era recommenders |
| TensorFlow Recommenders | Python | Two-tower retrieval | Modern Google recommendations stack; dot-product MF as a special case |
| Apache Mahout | Java | ParallelALSFactorizationJob ("WRMF" mode) | One of the earliest open implementations; Hadoop-based |
| Surprise | Python | (No native iALS; supports SVD, NMF, KNN-based) | Mostly explicit-feedback algorithms |
| LensKit | Python | lenskit.algorithms.als.ImplicitMF | Research-oriented; reproducible recommender benchmarks |
| LightFM | Python | (Hybrid; uses WARP / BPR loss rather than WALS) | Often compared against iALS as a baseline |
| Microsoft Recommenders | Python | Wraps Spark's ALS for implicit feedback | Includes notebooks reproducing benchmark results |
| QMF | C++ | Quora's matrix factorization library | Fast multithreaded WRMF |
Benfred Frederickson's implicit library deserves special mention.[18] Its CUDA kernels can train a 100-factor iALS model on the Last.fm 1B dataset in minutes on a single GPU, an order of magnitude faster than the Spark implementation on a comparable cluster.[17] The library also exposes BPR and Logistic Matrix Factorization, allowing direct comparison between WALS and ranking-based alternatives.
WALS has been applied across virtually every domain where implicit feedback dominates the available signal.
Music streaming services were early adopters. Spotify documented in a 2014 blog series by Erik Bernhardsson that its core collaborative filtering pipeline was built on iALS over user listening histories, with item embeddings then indexed by Annoy for approximate nearest neighbour search.[16] Bernhardsson's open-source Annoy library was originally developed at Spotify partly to serve those WALS embeddings for retrieval. Even after Spotify moved much of its retrieval to two-tower neural models in the late 2010s, iALS remained a strong production baseline and an honest benchmark for new architectures.
The original Hu, Koren, and Volinsky paper used TV viewing data from set-top boxes: implicit signals from a US cable provider with millions of subscribers and tens of thousands of channels and shows.[1] YouTube has published several recommender architectures over the years, including a 2016 paper ("Deep Neural Networks for YouTube Recommendations") that explicitly contrasts its deep retrieval model against matrix factorization baselines of the WALS family.[15]
Large online stores use WALS over implicit purchase, view, and add-to-cart signals. Etsy described its use of an iALS-style model in a 2014 engineering blog post for personalised recommendations on its homepage, and Quora open-sourced its QMF library specifically to scale WRMF to its question-answering platform.[19] The technique is also a standard in the AWS Personalize service's repertoire of off-the-shelf algorithms for implicit feedback.
Click-through rate prediction systems often start with a WALS-style embedding learned over historical clicks, then enrich it with side features through models such as factorization machines, DLRM,[13] or deep & cross networks. Even when downstream rankers are deep, the candidate generation step is frequently a dot-product retrieval over WALS-trained embeddings.
Query-document interaction matrices can be factorized with WALS to produce query and document embeddings used as additional signals in learning-to-rank pipelines. Microsoft, Bing, and Yahoo have all published papers using iALS-style techniques as feature generators or candidate retrievers in personalised search.
Beyond classic recommendations, WALS appears in several scientific domains. Genomics studies use it to recover patient-mutation associations from sparse count matrices. Network analysis applies WALS to citation and collaboration graphs. Educational technology platforms use it to recommend exercises by treating student-question interactions as implicit feedback.
WALS is one of several matrix factorization algorithms. The choice depends on the data type, the loss function, and the engineering constraints.
| Method | Loss | Feedback | Optimization | Strengths | Weaknesses |
|---|---|---|---|---|---|
| Truncated SVD | Frobenius reconstruction | Fully observed dense matrix | Lanczos / randomized SVD | Optimal under squared error; closed form | Cannot handle missing data natively; treats zeros as observations |
| Funk SVD | Squared error on observed | Explicit | SGD | Simple; fast on sparse explicit data | Ignores unobserved pairs; weak on implicit feedback |
| Biased SVD | Squared error with biases | Explicit | SGD or ALS | Captures user/item bias; very strong on Netflix-style data | Same weakness on implicit feedback |
| SVD++ | Squared error with implicit term | Mixed | SGD | Combines explicit ratings with implicit signals | More expensive than basic SVD |
| NMF | KL or Frobenius | Counts (non-negative) | Multiplicative updates / ANLS | Interpretable parts; non-negative factors | No native confidence weighting for implicit |
| WALS / iALS | Weighted squared error over all pairs | Implicit | Alternating closed-form least squares | Industrial-scale implicit feedback; embeds well for ANN | Trains a global model; not personalised online |
| BPR-MF | Pairwise ranking (logistic) | Implicit | SGD over (user, pos, neg) triples | Optimises ranking directly; strong top-K performance | Triplet sampling can be slow; less stable than WALS |
| Logistic Matrix Factorization | Binary cross-entropy weighted | Implicit | SGD or ALS variants | Probabilistic interpretation; calibrated outputs | Often comparable to iALS but more hyperparameters |
| Probabilistic Matrix Factorization (PMF) | Gaussian likelihood + Gaussian prior | Explicit | MAP via SGD | Bayesian interpretation of regularization | Not designed for implicit feedback |
| Bayesian PMF | Same as PMF with hyperpriors | Explicit | MCMC | Automatic hyperparameter selection | Higher computational cost |
| Factorization Machines | Squared / logistic with side features | Either | SGD / ALS / MCMC | Handles arbitrary features | Loses some scalability benefits of pure MF |
| CCD++ | Squared error rank-one updates | Either | Cyclic coordinate descent | Very fast on multi-core CPUs | Less common in modern toolchains |
| Deep Matrix Factorization | Reconstruction with neural transforms | Either | SGD | Captures non-linearities | Often beaten by tuned iALS in benchmarks |
Classical truncated SVD operates on a fully observed dense matrix and assumes uniform error variance. Applying it to an implicit feedback matrix means treating unobserved pairs as hard zeros, which biases the decomposition toward popular items. WALS keeps the dense formulation but downweights unobserved pairs by setting their confidence to 1 while observed pairs get confidence 1 + alpha * r_ui. This single change converts SVD-style decomposition into a method that respects the asymmetry of implicit data.
Funk SVD only computes loss over observed entries, which works well for explicit ratings but discards a key signal in implicit data: the absence of an interaction. WALS includes every pair in the loss with a small but non-zero weight, allowing the model to push apart observed and unobserved items in latent space.
Both WALS and Bayesian Personalized Ranking (BPR) target implicit feedback, but they optimize different objectives.[5] WALS minimizes weighted squared error over all pairs; BPR minimizes a pairwise ranking loss between observed and unobserved items. Empirically, BPR can produce sharper top-K rankings on dense datasets, while WALS is more stable, easier to parallelise, and converges in fewer iterations. The 2022 "Revisiting the Performance of iALS" paper by Rendle, Krichene, Zhang, and Anderson showed that with careful tuning of regularization, sampling, and frequency weighting, iALS matches or exceeds BPR on standard benchmarks (MovieLens, Pinterest, Million Songs).[7]
Non-negative matrix factorization constrains all factor entries to be non-negative, producing parts-based representations that are often interpretable. WALS has no sign constraint and is optimised for prediction quality on implicit data rather than interpretability. NMF can be adapted to implicit feedback (see Hu, Koren, and Volinsky's discussion of weighted NMF),[1] but the WALS formulation is usually faster and easier to implement.
Deep learning and self-supervised methods have produced several alternatives to WALS for implicit feedback recommendation.
| Method | Year | Architecture | Pros vs WALS | Cons vs WALS |
|---|---|---|---|---|
| Neural Collaborative Filtering (NCF) | 2017 | Two-branch MLP combining GMF and MLP | Captures non-linearities; flexible inputs | Slower; Rendle 2020 showed properly tuned WALS often matches it |
| Variational Autoencoder for CF (Mult-VAE) | 2018 | Encoder-decoder with multinomial likelihood | Strong on dense datasets; principled probabilistic loss | Larger memory footprint; harder retrieval |
| Two-tower retrieval | ~2016+ | Separate user and item encoders | Native support for side features | Higher engineering complexity |
| Deep Learning Recommendation Model (DLRM) | 2019 (Meta) | Embedding tables + MLP + feature interactions | Native handling of categorical and numeric features | Designed for ranking, not retrieval |
| Wide & Deep | 2016 (Google) | Linear wide branch + deep MLP branch | Generalisation plus memorisation | Heavier than WALS; needs feature engineering |
| SASRec | 2018 | Self-attention over user history | Captures sequential preferences | Requires user histories; offline retraining |
| BERT4Rec | 2019 | Bidirectional transformer | Stronger sequence modelling | Even higher compute cost |
| GRU4Rec | 2016 | Gated recurrent network over sessions | Good for session-based recommendation | Less of a general retrieval model |
| EASE / EASE^R | 2019 | Item-item linear autoencoder | Closed-form; sometimes beats neural baselines | O(n^2) memory in items |
| LightGCN | 2020 | Linearised graph convolution on user-item graph | Strong on benchmarks; conceptually simple | Graph propagation cost |
A recurring theme in this list is that carefully tuned WALS remains a competitive baseline. Steffen Rendle, Walid Krichene, Li Zhang, and John Anderson's 2020 RecSys paper "Neural Collaborative Filtering vs. Matrix Factorization Revisited" reproduced the original NCF experiments[9] and showed that a properly hyperparameter-tuned dot-product matrix factorization outperforms the MLP-based learned similarity.[6] Their 2022 follow-up, "Revisiting the Performance of iALS," showed the same for WALS specifically: with appropriate frequency-aware regularization, item bias, and the confidence functional form chosen by cross-validation, iALS matches or beats Mult-VAE,[10] EASE,[11] and LightGCN[12] on the MovieLens-20M, MSD, and Yelp benchmarks.[7]
In industry the picture is similar. Most large recommender pipelines combine WALS-style retrieval with a deep ranker. WALS-trained item embeddings feed an approximate nearest neighbour index (FAISS, ScaNN, Annoy) to generate a few hundred candidates per user in real time; a downstream neural model with side features then re-ranks those candidates. The role of WALS has narrowed from end-to-end recommender to fast candidate generator, but it remains an essential ingredient.
The main hyperparameters of WALS are:
| Hyperparameter | Symbol | Typical values | Notes |
|---|---|---|---|
| Latent dimension | f | 32 to 512 | Higher gives more capacity but more memory and compute |
| Regularization | lambda | 1e-3 to 1e1 | Tune by cross-validation; frequency-aware variant scales by user/item interaction count |
| Confidence weight scale | alpha | 1 to 100 | Larger pulls observed pairs harder; the original paper used 40 |
| Confidence functional form | linear or log | Pick by experiment | Log compresses heavy-tailed counts |
| Number of iterations | T | 10 to 30 | Loss usually plateaus quickly |
| Bias term | optional | on / off | Frequently improves accuracy at trivial cost |
Unlike SGD, WALS has no learning rate to tune. The closed-form alternation removes a major source of fragility from the training process, which is one of the reasons it has remained popular over more than fifteen years.
WALS has well-known limitations that drove the development of richer methods.
X and Y grow linearly with the number of users and items. For services with billions of users, hash embedding tricks, quantisation, or feature hashing become necessary.Despite these limitations, WALS continues to be deployed in production at companies such as Spotify, Netflix, YouTube, Etsy, and Quora, often as the candidate generator inside a larger neural ranking system. It is fast to train, easy to tune, robust to noise, and produces embeddings that integrate cleanly with approximate nearest neighbour search.
WALS sits at the intersection of several key ideas in recommendation and representation learning.
Y^T C^u Y = Y^T Y + Y^T (C^u - I) Y is a special case of the Sherman-Morrison-Woodbury identity in numerical linear algebra.