User matrix

See also: Machine learning terms

In collaborative filtering and matrix factorization recommender systems, the user matrix (commonly written U or P) is the matrix of latent-factor vectors for users. Each row is one user's embedding in a low-dimensional latent space, and each column is one of the latent factors learned during training. Together with the item matrix V (sometimes written Q), the user matrix is used to approximate the very large, very sparse rating matrix R that records how users have interacted with items.

The user matrix is the user-side half of the basic matrix factorization model that powered most production recommendation system work from the mid-2000s onward, and it remains the conceptual ancestor of the user-side embedding tower in modern neural retrieval architectures.

definition and notation

Let there be N users and M items, and let R be the N by M rating matrix whose entry r_{u,i} is user u's rating of item i (or 0 / missing if the user has not rated that item). Matrix factorization picks a small latent dimension k (typically 10 to 200) and learns two matrices:

The user matrix U with shape N by k. Row U_u is the latent vector for user u.
The item matrix V with shape M by k. Row V_i is the latent vector for item i.

The rating matrix is approximated by the product

R ≈ U V^T

so the predicted rating for user u on item i is the dot product

̂r_{u,i} = U_u · V_i = Σ_{f=1..k} U_{u,f} V_{i,f}.

A more accurate version adds bias terms:

̂r_{u,i} = μ + b_u + b_i + U_u · V_i

where μ is the global mean rating, b_u is a user bias (some users rate everything high), and b_i is an item bias (some items are simply more popular). Koren, Bell, and Volinsky introduced this baseline-plus-bias formulation in their 2009 IEEE Computer survey, and it is still the default in libraries such as Surprise and Spark MLlib.

The latent factors learned in U have no fixed meaning. After training they may loosely correspond to genres, demographics, or quality, but the model is free to use the dimensions however it wants. This is the entire point of model-based collaborative filtering: instead of asking a human to define features, the optimizer discovers a user representation that explains the observed ratings.

origin

Classical Singular Value Decomposition factors a complete matrix into U Σ V^T. Recommender data is almost never complete (a typical user rates a few hundred of millions of possible items), so true SVD does not apply directly. The breakthrough was to treat the missing entries as missing rather than as zero, and to learn U and V by minimizing squared error only on the observed ratings.

Simon Funk popularized this approach in his December 11, 2006 blog post "Try This At Home," written during the Netflix Prize competition. Funk trained one latent factor at a time using stochastic gradient descent on the residual ratings and shared C code that fit on a few pages. His model, often called "Funk SVD" or simply MF, was unrelated to the strict mathematical SVD but inherited the name. Funk's submission climbed to third place on the Netflix leaderboard and inspired most of the techniques that the Netflix Prize winners eventually combined.

Yehuda Koren, Robert Bell, and Chris Volinsky generalized the idea, added bias terms, integrated implicit feedback, and described the full framework in the 2009 paper "Matrix Factorization Techniques for Recommender Systems" published in IEEE Computer (volume 42, issue 8, pages 30 to 37). That paper is the standard reference for the user-matrix / item-matrix formulation and won the field's most cited recommender systems publication of the decade.

The Netflix Prize itself ran from October 2006 to September 2009 with a $1,000,000 grand prize for any team that could beat Cinematch's RMSE on a held-out test set by 10 percent. The team "BellKor's Pragmatic Chaos" (which included Bell, Koren, Volinsky, plus researchers from Pragmatic Theory and Commendo Research) won the prize on September 21, 2009, beating a tying team "The Ensemble" by 20 minutes on the final submission. Their winning model was a blend of dozens of components, but the matrix factorization core (with user and item matrices) was the single most important ingredient.

training the user matrix

The user and item matrices are usually trained by minimizing regularized squared error on the observed ratings:

L = Σ_{(u,i)∈observed} (r_{u,i} − μ − b_u − b_i − U_u · V_i)² + λ (‖U_u‖² + ‖V_i‖² + b_u² + b_i²)

The λ term is L2 regularization that keeps factor magnitudes from blowing up on users or items with very few ratings. Three optimization methods dominate.

method	how it works	strengths	weaknesses
stochastic gradient descent (SGD)	iterate over ratings; for each (u, i, r) take a gradient step on U_u, V_i, b_u, b_i	very fast per step, easy to implement, what Funk used in 2006	hyperparameter-sensitive; harder to parallelize across many machines
alternating least squares (ALS)	hold V fixed, solve a least-squares problem for each row of U; then swap and solve for V	each subproblem is closed-form and embarrassingly parallel; works well on Spark	per-iteration cost is higher; less natural for online updates
coordinate descent	update one coordinate of one row at a time	very fast convergence on sparse problems; used in LibFM and LibMF	trickier to implement; not built into mainstream frameworks

In practice, SGD dominates research code (the Surprise library defaults to SGD), while ALS dominates large-scale industry deployments because the per-iteration step parallelizes cleanly. Hu, Koren, and Volinsky's 2008 ICDM paper introduced an ALS variant for implicit feedback that scales linearly with the number of observed interactions, which is what made it possible to train the user matrix on billions of clicks at companies like Yahoo and AT&T.

variants

The basic user matrix has been extended many times. The most influential variants are listed below.

variant	year and reference	what it adds to U
Funk SVD	Funk, 2006 ("Try This At Home" blog)	the original biased matrix factorization with SGD; one latent factor at a time
regularized SVD with bias	Paterek, 2007 KDD Cup	global mean plus user and item bias terms added to U_u · V_i
SVD++	Koren, 2008 KDD ("Factorization Meets the Neighborhood")	adds an implicit-feedback term to the user vector based on the set of items the user has interacted with, even without ratings
timeSVD++	Koren, 2009 KDD; CACM 2010	makes user biases and user latent factors functions of time, capturing drift in taste
probabilistic matrix factorization (PMF)	Salakhutdinov and Mnih, NIPS 2007	gives U and V Gaussian priors and trains by maximum a posteriori; scales linearly and handles sparse Netflix data well
Bayesian PMF	Salakhutdinov and Mnih, ICML 2008	full Bayesian treatment with MCMC sampling over U and V
non-negative matrix factorization (NMF)	Lee and Seung, Nature 1999	constrains both U and V to be non-negative, often producing more interpretable factors
implicit ALS	Hu, Koren, Volinsky, ICDM 2008	redefines the loss as confidence-weighted preference for implicit feedback (clicks, plays, views) instead of explicit ratings
factorization machines	Rendle, ICDM 2010	generalizes the user matrix to handle arbitrary side features (user demographics, device, time of day) within the same factorized framework

SVD++ is worth singling out because it explicitly modifies the user vector. The user representation is no longer a single learned row of U; it becomes

user vector = U_u + |N(u)|^(−1/2) Σ_{j ∈ N(u)} y_j

where N(u) is the set of items user u has implicitly interacted with and y_j is a per-item implicit-feedback factor. This means the model still has a base user matrix U, but it augments each user's embedding with information from items they have viewed even when they did not give an explicit rating. SVD++ was a major contributor to the BellKor Netflix Prize submission.

cold start and sparsity

A pure user matrix has one row per user, learned only from that user's observed ratings. This causes two well-known problems:

A new user has no observed ratings, so there is no signal to fit U_u. The model has to fall back to the global mean plus item bias, which produces the same recommendation for everyone. This is the user-side cold start problem.
Even existing users may have very few ratings, so the learned row is dominated by the L2 regularizer. The model effectively shrinks the user toward the average user.

The standard mitigations are to seed new users with explicit onboarding questions, to use content-based features (age, country, signup source) to initialize U_u, or to switch to a model like factorization machines that can blend latent factors with side features in the same equation. Two-tower neural models, discussed below, solved the cold-start problem more cleanly because the user tower is a function of features instead of a lookup row.

implicit vs explicit feedback

Classical matrix factorization assumes explicit ratings such as 1-to-5 stars. Most production systems do not have explicit ratings; they have implicit feedback signals such as clicks, plays, dwell time, purchases, or completions. Implicit feedback has different statistical properties:

There is no negative signal. A user who never clicked an item may have not seen it rather than disliked it.
The values are confidence levels, not preferences. Watching a movie 10 times is stronger evidence than watching it once, but both are positive.

Hu, Koren, and Volinsky's 2008 paper redefined the objective as a confidence-weighted least squares problem over a binary preference matrix and gave an ALS algorithm whose cost scales linearly with the number of nonzero entries. This is the algorithm implemented in Spark's ALS.trainImplicit and in the Python implicit package, and it is by far the most widely deployed user-matrix training procedure in industry.

production examples

Matrix factorization with a learned user matrix has powered or contributed to recommendation systems at Netflix, Amazon, Yahoo, AT&T, LinkedIn, Spotify, and Last.fm. Even when the production model is now a deep neural network, the user matrix usually survives in some form as the user-side embedding table, because the ID-to-vector lookup is the smallest, fastest, and most reliable piece of any recommender.

Netflix dropped the actual Cinematch system in 2012 in favor of personalized ranking models, but Netflix engineers have stated publicly that the matrix factorization techniques developed during the Prize remained part of the production stack for years afterward.

evaluation

The user matrix is rarely evaluated in isolation. The metrics measure the quality of the predictions or rankings produced by combining U and V.

metric	what it measures	typical use
RMSE (root mean squared error)	average squared error between predicted and held-out ratings	explicit feedback, Netflix Prize style benchmarks
MAE (mean absolute error)	average absolute error	explicit feedback, less sensitive to outliers than RMSE
HR@k (hit rate at k)	fraction of users for whom the held-out item appears in the top k	implicit feedback ranking
NDCG@k (normalized discounted cumulative gain)	top-k ranking quality with positional discount	implicit feedback ranking, weighted by relevance
MAP (mean average precision)	precision averaged across recall levels	implicit feedback ranking

implementations

library	language	notes
Surprise	Python	textbook implementations of SVD, SVD++, NMF, KNN baselines; SGD-based
Spark MLlib `ALS`	Scala / Python / Java	distributed ALS for both explicit and implicit feedback (`implicitPrefs=True`)
`implicit` (Ben Frederickson)	Python with C extensions	fast ALS, BPR, and logistic MF for implicit feedback datasets
LibFM	C++	factorization machines by Steffen Rendle; supports SGD, ALS, MCMC
LibMF	C++	high-performance parallel SGD for matrix factorization
TensorFlow Recommenders	Python	modern factorization models, retrieval, and two-tower architectures

modern context

Deep learning displaced classical matrix factorization in many production systems during the late 2010s, but the user matrix concept did not disappear. It evolved into the user tower of two-tower retrieval models.

approach	how the user representation is built	where it is used
classical MF	one trained row of U per user (a lookup)	Netflix Prize, Surprise, Spark ALS, most baselines
Neural Collaborative Filtering (NCF)	user embedding fed through an MLP that also takes the item embedding	He et al., WWW 2017
Wide and Deep	embedding for low-dimensional latent factors plus wide cross features	Cheng et al., Google, RecSys 2016
DLRM	sparse user embeddings combined with dense features and feature crosses	Naumov et al., Facebook, 2019
two-tower model	user tower is a neural network mapping user features and history to an embedding; item tower does the same	YouTube, TikTok, Spotify, Pinterest, Google retrieval
Graph neural recommenders (PinSage, LightGCN)	user node embeddings are aggregated from the user-item interaction graph	Pinterest, Alibaba
LLM-based recommenders	user representation is a prompt or learned soft prompt; item retrieval is by language modeling	research and early production at TikTok, Meta, Amazon

The two-tower architecture is the most direct descendant of classical matrix factorization. The user tower replaces the static row of U with a function of user features (history, demographics, recent queries) so that new users can be represented immediately. The dot product between the user-tower output and the item-tower output is the same scoring function as U_u · V_i, and the resulting embeddings are still indexed by approximate nearest neighbor search. In that sense, the user matrix has not been replaced; it has been generalized.

limitations

The user matrix in its classical form has well-known weaknesses:

It cannot incorporate side features. The user is just an opaque row; demographics, device, and context are ignored unless extensions like factorization machines are used.
It does not handle cold start. New users must be bootstrapped from content features or onboarding flows.
It is static. New behavior does not change U_u until the model is retrained, which is why timeSVD++ added explicit time dependence.
It captures only linear interactions through the dot product. NCF, two-tower MLPs, and graph models all learn richer interaction functions.
It produces uninterpretable factors. Unlike NMF, the dimensions of standard SVD-style U have no guarantee of corresponding to meaningful concepts.

Despite these limitations, the user matrix remains the conceptual baseline for almost every academic recommender system paper, the de facto starting point for industry teams building a new recommender, and a building block inside larger neural systems. When practitioners say "learn user embeddings," they almost always mean some descendant of U.

explain like I'm 5

Imagine a giant grid where every row is a person at a candy party and every column is a different type of candy. Most squares are empty because most people have not tried most candies. The user matrix is a much smaller grid that gives each person a short list of secret numbers describing what kinds of candy they probably like (sweet, sour, chocolatey, chewy, and so on). The candy gets its own short list of secret numbers too. To guess whether someone will like a candy, you multiply their numbers with the candy's numbers and add them up. The bigger the answer, the better the match. The user matrix is the part of the system that holds everyone's secret-numbers card.

references

Funk, S. (2006). Netflix Update: Try This At Home. https://sifter.org/simon/journal/20061211.html
Lee, D. D., and Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401, 788 to 791. https://www.nature.com/articles/44565
Salakhutdinov, R., and Mnih, A. (2008). Probabilistic Matrix Factorization. Advances in Neural Information Processing Systems 20 (NIPS 2007). https://papers.nips.cc/paper/3208-probabilistic-matrix-factorization
Hu, Y., Koren, Y., and Volinsky, C. (2008). Collaborative Filtering for Implicit Feedback Datasets. Proceedings of the 2008 IEEE International Conference on Data Mining (ICDM), 263 to 272. http://yifanhu.net/PUB/cf.pdf
Koren, Y. (2008). Factorization Meets the Neighborhood: A Multifaceted Collaborative Filtering Model. Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 426 to 434.
Koren, Y., Bell, R., and Volinsky, C. (2009). Matrix Factorization Techniques for Recommender Systems. IEEE Computer, 42(8), 30 to 37. https://datajobs.com/data-science-repo/Recommender-Systems-%5BNetflix%5D.pdf
Koren, Y. (2010). Collaborative Filtering with Temporal Dynamics. Communications of the ACM, 53(4), 89 to 97. https://cacm.acm.org/research/collaborative-filtering-with-temporal-dynamics/
Rendle, S. (2010). Factorization Machines. Proceedings of the 10th IEEE International Conference on Data Mining (ICDM), 995 to 1000. https://www.ismll.uni-hildesheim.de/pub/pdfs/Rendle2010FM.pdf
He, X., Liao, L., Zhang, H., Nie, L., Hu, X., and Chua, T. S. (2017). Neural Collaborative Filtering. Proceedings of the 26th International Conference on World Wide Web, 173 to 182. https://arxiv.org/abs/1708.05031
BellKor's Pragmatic Chaos. AT&T Research summary of the Netflix Prize winning team. http://stats.research.att.com/volinsky/bpc.html
Netflix Prize. Wikipedia. https://en.wikipedia.org/wiki/Netflix_Prize
Apache Spark documentation. Collaborative Filtering (MLlib ALS). https://spark.apache.org/docs/latest/ml-collaborative-filtering.html
Hug, N. (2020). Surprise: A Python library for recommender systems. Journal of Open Source Software, 5(52), 2174. https://surprise.readthedocs.io/

definition and notation

origin

training the user matrix

variants

cold start and sparsity

implicit vs explicit feedback

production examples

evaluation

implementations

modern context

limitations

explain like I'm 5

references

Improve this article

Related Articles

Item matrix

Machine learning terms/Recommendation Systems

Weighted Alternating Least Squares (WALS)

Candidate Generation

Collaborative filtering

Matrix factorization

definition and notation

origin

training the user matrix

variants

cold start and sparsity

implicit vs explicit feedback

production examples

evaluation

implementations

modern context

limitations

explain like I'm 5

references

Related Articles

Item matrix

Machine learning terms/Recommendation Systems

Weighted Alternating Least Squares (WALS)

Candidate Generation

Collaborative filtering

Matrix factorization