# User matrix

> Source: https://aiwiki.ai/wiki/user_matrix
> Updated: 2026-06-24
> Categories: Machine Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

In [collaborative filtering](/wiki/collaborative_filtering) and [matrix factorization](/wiki/matrix_factorization) recommender systems, the **user matrix** (commonly written U or P) is the matrix of latent-factor vectors for users: each row is one user's [embedding](/wiki/embedding) in a low-dimensional latent space, and each column is one of the latent factors learned during training. Together with the [item matrix](/wiki/item_matrix) V (sometimes written Q), the user matrix approximates the very large, very sparse [rating matrix](/wiki/rating_matrix) R through the factorization R approx U V^T, so a predicted rating is the dot product of a user row and an item row. The number of latent factors k is small, typically 10 to 200, even when the rating matrix has millions of users and items [6].

The user matrix is the user-side half of the basic matrix factorization model that powered most production [recommendation system](/wiki/recommender_system) work from the mid-2000s onward, and it remains the conceptual ancestor of the user-side embedding tower in modern neural retrieval architectures. As Koren, Bell, and Volinsky put it in their 2009 IEEE Computer survey, "matrix factorization models are superior to classic nearest-neighbor techniques for producing product recommendations, allowing the incorporation of additional information such as implicit feedback, temporal effects, and confidence levels" [6].

## What is the user matrix? (definition and notation)

Let there be N users and M items, and let R be the N by M rating matrix whose entry r_{u,i} is user u's rating of item i (or 0 / missing if the user has not rated that item). Matrix factorization picks a small latent dimension k (typically 10 to 200) and learns two matrices [6]:

- The user matrix U with shape N by k. Row U_u is the latent vector for user u.
- The item matrix V with shape M by k. Row V_i is the latent vector for item i.

The rating matrix is approximated by the product

    R ≈ U V^T

so the predicted rating for user u on item i is the dot product

    ̂r_{u,i} = U_u · V_i = Σ_{f=1..k} U_{u,f} V_{i,f}.

A more accurate version adds bias terms:

    ̂r_{u,i} = μ + b_u + b_i + U_u · V_i

where μ is the global mean rating, b_u is a user bias (some users rate everything high), and b_i is an item bias (some items are simply more popular). Koren, Bell, and Volinsky introduced this baseline-plus-bias formulation in their 2009 IEEE Computer survey, and it is still the default in libraries such as Surprise and Spark MLlib [6][12][13].

The latent factors learned in U have no fixed meaning. After training they may loosely correspond to genres, demographics, or quality, but the model is free to use the dimensions however it wants. This is the entire point of model-based collaborative filtering: instead of asking a human to define features, the optimizer discovers a user representation that explains the observed ratings [6].

## Where did the user matrix come from? (origin)

Classical [Singular Value Decomposition](/wiki/singular_value_decomposition) factors a complete matrix into U Σ V^T. Recommender data is almost never complete (a typical user rates a few hundred of millions of possible items), so true SVD does not apply directly. The breakthrough was to treat the missing entries as missing rather than as zero, and to learn U and V by minimizing squared error only on the observed ratings.

Simon Funk popularized this approach in his blog post "Netflix Update: Try This At Home," published December 11, 2006 during the [Netflix Prize](/wiki/netflix_prize) competition [1]. Funk trained one latent factor at a time using stochastic gradient descent on the residual ratings and shared C code that fit on a few pages. He summarized the optimization in one line, "Just take the derivative of the approximation error and follow it," and noted that learning on observed ratings only means "we can choose to simply ignore the unknown error on the 8.4B empty slots" of the Netflix matrix [1]. His model, often called "Funk SVD" or simply MF, was unrelated to the strict mathematical SVD but inherited the name. Funk's submission reached third place on the Netflix leaderboard and inspired most of the techniques that the Netflix Prize winners eventually combined [1].

Yehuda Koren, Robert Bell, and Chris Volinsky generalized the idea, added bias terms, integrated implicit feedback, and described the full framework in the 2009 paper "Matrix Factorization Techniques for Recommender Systems" published in IEEE Computer (volume 42, pages 30 to 37) [6]. That paper is the standard reference for the user-matrix / item-matrix formulation and is among the most cited recommender systems publications of its decade.

The Netflix Prize itself ran from October 2, 2006 to July 26, 2009, with a $1,000,000 grand prize for any team that could beat Cinematch's RMSE on a held-out test set by 10 percent [11]. Cinematch scored a test RMSE of 0.9525, so the target was roughly 0.8572. The team "BellKor's Pragmatic Chaos" (which included Bell, Koren, and Volinsky, plus researchers from Pragmatic Theory and BigChaos) achieved a 10.06 percent improvement with a test RMSE of 0.8567 and was awarded the prize on September 21, 2009, beating a tying team "The Ensemble" by 20 minutes on the final submission [10][11]. Their winning model was a blend of dozens of components, but the matrix factorization core (with user and item matrices) was the single most important ingredient [10].

## How is the user matrix trained?

The user and item matrices are usually trained by minimizing regularized squared error on the observed ratings [6]:

    L = Σ_{(u,i)∈observed} (r_{u,i} − μ − b_u − b_i − U_u · V_i)² + λ (‖U_u‖² + ‖V_i‖² + b_u² + b_i²)

The λ term is L2 regularization that keeps factor magnitudes from blowing up on users or items with very few ratings. Three optimization methods dominate.

| method | how it works | strengths | weaknesses |
| --- | --- | --- | --- |
| stochastic gradient descent ([SGD](/wiki/stochastic_gradient_descent_sgd)) | iterate over ratings; for each (u, i, r) take a gradient step on U_u, V_i, b_u, b_i | very fast per step, easy to implement, what Funk used in 2006 | hyperparameter-sensitive; harder to parallelize across many machines |
| alternating least squares ([ALS](/wiki/als)) | hold V fixed, solve a least-squares problem for each row of U; then swap and solve for V | each subproblem is closed-form and embarrassingly parallel; works well on Spark | per-iteration cost is higher; less natural for online updates |
| coordinate descent | update one coordinate of one row at a time | very fast convergence on sparse problems; used in LibFM and LibMF | trickier to implement; not built into mainstream frameworks |

In practice, SGD dominates research code (the Surprise library defaults to SGD), while ALS dominates large-scale industry deployments because the per-iteration step parallelizes cleanly [13][12]. Hu, Koren, and Volinsky's 2008 ICDM paper introduced an ALS variant for implicit feedback whose cost scales linearly with the number of observed interactions, which is what made it possible to train the user matrix on billions of clicks at companies like Yahoo and AT&T [4]. That paper later won the 2017 IEEE ICDM 10-Year Highest-Impact Paper Award [4].

## What are the main variants of the user matrix?

The basic user matrix has been extended many times. The most influential variants are listed below.

| variant | year and reference | what it adds to U |
| --- | --- | --- |
| Funk SVD | Funk, 2006 ("Try This At Home" blog) | the original biased matrix factorization with SGD; one latent factor at a time |
| regularized SVD with bias | Paterek, 2007 KDD Cup | global mean plus user and item bias terms added to U_u · V_i |
| SVD++ | Koren, 2008 KDD ("Factorization Meets the Neighborhood") | adds an implicit-feedback term to the user vector based on the set of items the user has interacted with, even without ratings |
| timeSVD++ | Koren, 2009 KDD; CACM 2010 | makes user biases and user latent factors functions of time, capturing drift in taste |
| probabilistic matrix factorization (PMF) | Salakhutdinov and Mnih, NIPS 2007 | gives U and V Gaussian priors and trains by maximum a posteriori; scales linearly and handles sparse Netflix data well |
| Bayesian PMF | Salakhutdinov and Mnih, ICML 2008 | full Bayesian treatment with MCMC sampling over U and V |
| non-negative matrix factorization ([NMF](/wiki/nmf)) | Lee and Seung, Nature 1999 | constrains both U and V to be non-negative, often producing more interpretable factors |
| implicit ALS | Hu, Koren, Volinsky, ICDM 2008 | redefines the loss as confidence-weighted preference for implicit feedback (clicks, plays, views) instead of explicit ratings |
| factorization machines | Rendle, ICDM 2010 | generalizes the user matrix to handle arbitrary side features (user demographics, device, time of day) within the same factorized framework |

SVD++ is worth singling out because it explicitly modifies the user vector. The user representation is no longer a single learned row of U; it becomes [5]

    user vector = U_u + |N(u)|^(−1/2) Σ_{j ∈ N(u)} y_j

where N(u) is the set of items user u has implicitly interacted with and y_j is a per-item implicit-feedback factor. This means the model still has a base user matrix U, but it augments each user's embedding with information from items they have viewed even when they did not give an explicit rating. SVD++ was a major contributor to the BellKor Netflix Prize submission [5][10].

## What are the cold-start and sparsity problems?

A pure user matrix has one row per user, learned only from that user's observed ratings. This causes two well-known problems:

- A new user has no observed ratings, so there is no signal to fit U_u. The model has to fall back to the global mean plus item bias, which produces the same recommendation for everyone. This is the user-side **[cold start](/wiki/cold_start)** problem.
- Even existing users may have very few ratings, so the learned row is dominated by the L2 regularizer. The model effectively shrinks the user toward the average user.

The standard mitigations are to seed new users with explicit onboarding questions, to use content-based features (age, country, signup source) to initialize U_u, or to switch to a model like factorization machines that can blend latent factors with side features in the same equation [8]. Two-tower neural models, discussed below, solved the cold-start problem more cleanly because the user tower is a function of features instead of a lookup row.

## How does implicit feedback differ from explicit feedback?

Classical matrix factorization assumes explicit ratings such as 1-to-5 stars. Most production systems do not have explicit ratings; they have **[implicit feedback](/wiki/implicit_feedback)** signals such as clicks, plays, dwell time, purchases, or completions. Implicit feedback has different statistical properties [4]:

- There is no negative signal. A user who never clicked an item may have not seen it rather than disliked it.
- The values are confidence levels, not preferences. Watching a movie 10 times is stronger evidence than watching it once, but both are positive.

Hu, Koren, and Volinsky's 2008 paper redefined the objective as a confidence-weighted least squares problem over a binary preference matrix and gave an ALS algorithm whose cost scales linearly with the number of nonzero entries [4]. This is the algorithm implemented in Spark's `ALS.trainImplicit` and in the Python `implicit` package, and it is by far the most widely deployed user-matrix training procedure in industry [12].

## Where is the user matrix used in production?

Matrix factorization with a learned user matrix has powered or contributed to recommendation systems at Netflix, Amazon, Yahoo, AT&T, LinkedIn, Spotify, and Last.fm [6][10]. Even when the production model is now a deep neural network, the user matrix usually survives in some form as the user-side embedding table, because the ID-to-vector lookup is the smallest, fastest, and most reliable piece of any recommender.

Netflix dropped the actual Cinematch system in 2012 in favor of personalized ranking models, but Netflix engineers have stated publicly that the matrix factorization techniques developed during the Prize remained part of the production stack for years afterward.

## How is the user matrix evaluated?

The user matrix is rarely evaluated in isolation. The metrics measure the quality of the predictions or rankings produced by combining U and V.

| metric | what it measures | typical use |
| --- | --- | --- |
| RMSE (root mean squared error) | average squared error between predicted and held-out ratings | explicit feedback, Netflix Prize style benchmarks |
| MAE (mean absolute error) | average absolute error | explicit feedback, less sensitive to outliers than RMSE |
| HR@k (hit rate at k) | fraction of users for whom the held-out item appears in the top k | implicit feedback ranking |
| NDCG@k (normalized discounted cumulative gain) | top-k ranking quality with positional discount | implicit feedback ranking, weighted by relevance |
| MAP (mean average precision) | precision averaged across recall levels | implicit feedback ranking |

## What libraries implement the user matrix?

| library | language | notes |
| --- | --- | --- |
| Surprise | Python | textbook implementations of SVD, SVD++, NMF, KNN baselines; SGD-based |
| Spark MLlib `ALS` | Scala / Python / Java | distributed ALS for both explicit and implicit feedback (`implicitPrefs=True`) |
| `implicit` (Ben Frederickson) | Python with C extensions | fast ALS, BPR, and logistic MF for implicit feedback datasets |
| LibFM | C++ | factorization machines by Steffen Rendle; supports SGD, ALS, MCMC |
| LibMF | C++ | high-performance parallel SGD for matrix factorization |
| TensorFlow Recommenders | Python | modern factorization models, retrieval, and two-tower architectures |

## How does the user matrix relate to modern deep learning?

Deep learning displaced classical matrix factorization in many production systems during the late 2010s, but the user matrix concept did not disappear. It evolved into the **user tower** of two-tower retrieval models.

| approach | how the user representation is built | where it is used |
| --- | --- | --- |
| classical MF | one trained row of U per user (a lookup) | Netflix Prize, Surprise, Spark ALS, most baselines |
| Neural Collaborative Filtering (NCF) | user embedding fed through an MLP that also takes the item embedding | He et al., WWW 2017 |
| Wide and Deep | embedding for low-dimensional latent factors plus wide cross features | Cheng et al., Google, RecSys 2016 |
| DLRM | sparse user embeddings combined with dense features and feature crosses | Naumov et al., Facebook, 2019 |
| [two-tower model](/wiki/two-tower_model) | user tower is a neural network mapping user features and history to an embedding; item tower does the same | YouTube, TikTok, Spotify, Pinterest, Google retrieval |
| Graph neural recommenders (PinSage, LightGCN) | user node embeddings are aggregated from the user-item interaction graph | Pinterest, Alibaba |
| LLM-based recommenders | user representation is a prompt or learned soft prompt; item retrieval is by language modeling | research and early production at TikTok, Meta, Amazon |

The two-tower architecture is the most direct descendant of classical matrix factorization [9]. The user tower replaces the static row of U with a function of user features (history, demographics, recent queries) so that new users can be represented immediately. The dot product between the user-tower output and the item-tower output is the same scoring function as U_u · V_i, and the resulting embeddings are still indexed by approximate nearest neighbor search. In that sense, the user matrix has not been replaced; it has been generalized.

## What are the limitations of the user matrix?

The user matrix in its classical form has well-known weaknesses:

- It cannot incorporate side features. The user is just an opaque row; demographics, device, and context are ignored unless extensions like factorization machines are used [8].
- It does not handle cold start. New users must be bootstrapped from content features or onboarding flows.
- It is static. New behavior does not change U_u until the model is retrained, which is why timeSVD++ added explicit time dependence [7].
- It captures only linear interactions through the dot product. NCF, two-tower MLPs, and graph models all learn richer interaction functions [9].
- It produces uninterpretable factors. Unlike NMF, the dimensions of standard SVD-style U have no guarantee of corresponding to meaningful concepts [2].

Despite these limitations, the user matrix remains the conceptual baseline for almost every academic recommender system paper, the de facto starting point for industry teams building a new recommender, and a building block inside larger neural systems. When practitioners say "learn user embeddings," they almost always mean some descendant of U.

## Explain like I'm 5

Imagine a giant grid where every row is a person at a candy party and every column is a different type of candy. Most squares are empty because most people have not tried most candies. The user matrix is a much smaller grid that gives each person a short list of secret numbers describing what kinds of candy they probably like (sweet, sour, chocolatey, chewy, and so on). The candy gets its own short list of secret numbers too. To guess whether someone will like a candy, you multiply their numbers with the candy's numbers and add them up. The bigger the answer, the better the match. The user matrix is the part of the system that holds everyone's secret-numbers card.

## References

1. Funk, S. (2006). *Netflix Update: Try This At Home*. December 11, 2006. https://sifter.org/simon/journal/20061211.html
2. Lee, D. D., and Seung, H. S. (1999). *Learning the parts of objects by non-negative matrix factorization*. Nature, 401, 788 to 791. https://www.nature.com/articles/44565
3. Salakhutdinov, R., and Mnih, A. (2008). *Probabilistic Matrix Factorization*. Advances in Neural Information Processing Systems 20 (NIPS 2007). https://papers.nips.cc/paper/3208-probabilistic-matrix-factorization
4. Hu, Y., Koren, Y., and Volinsky, C. (2008). *Collaborative Filtering for Implicit Feedback Datasets*. Proceedings of the 2008 IEEE International Conference on Data Mining (ICDM), 263 to 272. http://yifanhu.net/PUB/cf.pdf
5. Koren, Y. (2008). *Factorization Meets the Neighborhood: A Multifaceted Collaborative Filtering Model*. Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 426 to 434.
6. Koren, Y., Bell, R., and Volinsky, C. (2009). *Matrix Factorization Techniques for Recommender Systems*. IEEE Computer, 42, 30 to 37. https://datajobs.com/data-science-repo/Recommender-Systems-%5BNetflix%5D.pdf
7. Koren, Y. (2010). *Collaborative Filtering with Temporal Dynamics*. Communications of the ACM, 53(4), 89 to 97. https://cacm.acm.org/research/collaborative-filtering-with-temporal-dynamics/
8. Rendle, S. (2010). *Factorization Machines*. Proceedings of the 10th IEEE International Conference on Data Mining (ICDM), 995 to 1000. https://www.ismll.uni-hildesheim.de/pub/pdfs/Rendle2010FM.pdf
9. He, X., Liao, L., Zhang, H., Nie, L., Hu, X., and Chua, T. S. (2017). *Neural Collaborative Filtering*. Proceedings of the 26th International Conference on World Wide Web, 173 to 182. https://arxiv.org/abs/1708.05031
10. Koren, Y. (2009). *The BellKor Solution to the Netflix Grand Prize*. https://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf
11. *Netflix Prize*. Wikipedia. https://en.wikipedia.org/wiki/Netflix_Prize
12. Apache Spark documentation. *Collaborative Filtering (MLlib ALS)*. https://spark.apache.org/docs/latest/ml-collaborative-filtering.html
13. Hug, N. (2020). *Surprise: A Python library for recommender systems*. Journal of Open Source Software, 5(52), 2174. https://surprise.readthedocs.io/

