See also: Machine learning terms
In collaborative filtering and matrix factorization recommender systems, the user matrix (commonly written U or P) is the matrix of latent-factor vectors for users. Each row is one user's embedding in a low-dimensional latent space, and each column is one of the latent factors learned during training. Together with the item matrix V (sometimes written Q), the user matrix is used to approximate the very large, very sparse rating matrix R that records how users have interacted with items.
The user matrix is the user-side half of the basic matrix factorization model that powered most production recommendation system work from the mid-2000s onward, and it remains the conceptual ancestor of the user-side embedding tower in modern neural retrieval architectures.
Let there be N users and M items, and let R be the N by M rating matrix whose entry r_{u,i} is user u's rating of item i (or 0 / missing if the user has not rated that item). Matrix factorization picks a small latent dimension k (typically 10 to 200) and learns two matrices:
The rating matrix is approximated by the product
R ≈ U V^T
so the predicted rating for user u on item i is the dot product
̂r_{u,i} = U_u · V_i = Σ_{f=1..k} U_{u,f} V_{i,f}.
A more accurate version adds bias terms:
̂r_{u,i} = μ + b_u + b_i + U_u · V_i
where μ is the global mean rating, b_u is a user bias (some users rate everything high), and b_i is an item bias (some items are simply more popular). Koren, Bell, and Volinsky introduced this baseline-plus-bias formulation in their 2009 IEEE Computer survey, and it is still the default in libraries such as Surprise and Spark MLlib.
The latent factors learned in U have no fixed meaning. After training they may loosely correspond to genres, demographics, or quality, but the model is free to use the dimensions however it wants. This is the entire point of model-based collaborative filtering: instead of asking a human to define features, the optimizer discovers a user representation that explains the observed ratings.
Classical Singular Value Decomposition factors a complete matrix into U Σ V^T. Recommender data is almost never complete (a typical user rates a few hundred of millions of possible items), so true SVD does not apply directly. The breakthrough was to treat the missing entries as missing rather than as zero, and to learn U and V by minimizing squared error only on the observed ratings.
Simon Funk popularized this approach in his December 11, 2006 blog post "Try This At Home," written during the Netflix Prize competition. Funk trained one latent factor at a time using stochastic gradient descent on the residual ratings and shared C code that fit on a few pages. His model, often called "Funk SVD" or simply MF, was unrelated to the strict mathematical SVD but inherited the name. Funk's submission climbed to third place on the Netflix leaderboard and inspired most of the techniques that the Netflix Prize winners eventually combined.
Yehuda Koren, Robert Bell, and Chris Volinsky generalized the idea, added bias terms, integrated implicit feedback, and described the full framework in the 2009 paper "Matrix Factorization Techniques for Recommender Systems" published in IEEE Computer (volume 42, issue 8, pages 30 to 37). That paper is the standard reference for the user-matrix / item-matrix formulation and won the field's most cited recommender systems publication of the decade.
The Netflix Prize itself ran from October 2006 to September 2009 with a $1,000,000 grand prize for any team that could beat Cinematch's RMSE on a held-out test set by 10 percent. The team "BellKor's Pragmatic Chaos" (which included Bell, Koren, Volinsky, plus researchers from Pragmatic Theory and Commendo Research) won the prize on September 21, 2009, beating a tying team "The Ensemble" by 20 minutes on the final submission. Their winning model was a blend of dozens of components, but the matrix factorization core (with user and item matrices) was the single most important ingredient.
The user and item matrices are usually trained by minimizing regularized squared error on the observed ratings:
L = Σ_{(u,i)∈observed} (r_{u,i} − μ − b_u − b_i − U_u · V_i)² + λ (‖U_u‖² + ‖V_i‖² + b_u² + b_i²)
The λ term is L2 regularization that keeps factor magnitudes from blowing up on users or items with very few ratings. Three optimization methods dominate.
| method | how it works | strengths | weaknesses |
|---|---|---|---|
| stochastic gradient descent (SGD) | iterate over ratings; for each (u, i, r) take a gradient step on U_u, V_i, b_u, b_i | very fast per step, easy to implement, what Funk used in 2006 | hyperparameter-sensitive; harder to parallelize across many machines |
| alternating least squares (ALS) | hold V fixed, solve a least-squares problem for each row of U; then swap and solve for V | each subproblem is closed-form and embarrassingly parallel; works well on Spark | per-iteration cost is higher; less natural for online updates |
| coordinate descent | update one coordinate of one row at a time | very fast convergence on sparse problems; used in LibFM and LibMF | trickier to implement; not built into mainstream frameworks |
In practice, SGD dominates research code (the Surprise library defaults to SGD), while ALS dominates large-scale industry deployments because the per-iteration step parallelizes cleanly. Hu, Koren, and Volinsky's 2008 ICDM paper introduced an ALS variant for implicit feedback that scales linearly with the number of observed interactions, which is what made it possible to train the user matrix on billions of clicks at companies like Yahoo and AT&T.
The basic user matrix has been extended many times. The most influential variants are listed below.
| variant | year and reference | what it adds to U |
|---|---|---|
| Funk SVD | Funk, 2006 ("Try This At Home" blog) | the original biased matrix factorization with SGD; one latent factor at a time |
| regularized SVD with bias | Paterek, 2007 KDD Cup | global mean plus user and item bias terms added to U_u · V_i |
| SVD++ | Koren, 2008 KDD ("Factorization Meets the Neighborhood") | adds an implicit-feedback term to the user vector based on the set of items the user has interacted with, even without ratings |
| timeSVD++ | Koren, 2009 KDD; CACM 2010 | makes user biases and user latent factors functions of time, capturing drift in taste |
| probabilistic matrix factorization (PMF) | Salakhutdinov and Mnih, NIPS 2007 | gives U and V Gaussian priors and trains by maximum a posteriori; scales linearly and handles sparse Netflix data well |
| Bayesian PMF | Salakhutdinov and Mnih, ICML 2008 | full Bayesian treatment with MCMC sampling over U and V |
| non-negative matrix factorization (NMF) | Lee and Seung, Nature 1999 | constrains both U and V to be non-negative, often producing more interpretable factors |
| implicit ALS | Hu, Koren, Volinsky, ICDM 2008 | redefines the loss as confidence-weighted preference for implicit feedback (clicks, plays, views) instead of explicit ratings |
| factorization machines | Rendle, ICDM 2010 | generalizes the user matrix to handle arbitrary side features (user demographics, device, time of day) within the same factorized framework |
SVD++ is worth singling out because it explicitly modifies the user vector. The user representation is no longer a single learned row of U; it becomes
user vector = U_u + |N(u)|^(−1/2) Σ_{j ∈ N(u)} y_j
where N(u) is the set of items user u has implicitly interacted with and y_j is a per-item implicit-feedback factor. This means the model still has a base user matrix U, but it augments each user's embedding with information from items they have viewed even when they did not give an explicit rating. SVD++ was a major contributor to the BellKor Netflix Prize submission.
A pure user matrix has one row per user, learned only from that user's observed ratings. This causes two well-known problems:
The standard mitigations are to seed new users with explicit onboarding questions, to use content-based features (age, country, signup source) to initialize U_u, or to switch to a model like factorization machines that can blend latent factors with side features in the same equation. Two-tower neural models, discussed below, solved the cold-start problem more cleanly because the user tower is a function of features instead of a lookup row.
Classical matrix factorization assumes explicit ratings such as 1-to-5 stars. Most production systems do not have explicit ratings; they have implicit feedback signals such as clicks, plays, dwell time, purchases, or completions. Implicit feedback has different statistical properties:
Hu, Koren, and Volinsky's 2008 paper redefined the objective as a confidence-weighted least squares problem over a binary preference matrix and gave an ALS algorithm whose cost scales linearly with the number of nonzero entries. This is the algorithm implemented in Spark's ALS.trainImplicit and in the Python implicit package, and it is by far the most widely deployed user-matrix training procedure in industry.
Matrix factorization with a learned user matrix has powered or contributed to recommendation systems at Netflix, Amazon, Yahoo, AT&T, LinkedIn, Spotify, and Last.fm. Even when the production model is now a deep neural network, the user matrix usually survives in some form as the user-side embedding table, because the ID-to-vector lookup is the smallest, fastest, and most reliable piece of any recommender.
Netflix dropped the actual Cinematch system in 2012 in favor of personalized ranking models, but Netflix engineers have stated publicly that the matrix factorization techniques developed during the Prize remained part of the production stack for years afterward.
The user matrix is rarely evaluated in isolation. The metrics measure the quality of the predictions or rankings produced by combining U and V.
| metric | what it measures | typical use |
|---|---|---|
| RMSE (root mean squared error) | average squared error between predicted and held-out ratings | explicit feedback, Netflix Prize style benchmarks |
| MAE (mean absolute error) | average absolute error | explicit feedback, less sensitive to outliers than RMSE |
| HR@k (hit rate at k) | fraction of users for whom the held-out item appears in the top k | implicit feedback ranking |
| NDCG@k (normalized discounted cumulative gain) | top-k ranking quality with positional discount | implicit feedback ranking, weighted by relevance |
| MAP (mean average precision) | precision averaged across recall levels | implicit feedback ranking |
| library | language | notes |
|---|---|---|
| Surprise | Python | textbook implementations of SVD, SVD++, NMF, KNN baselines; SGD-based |
Spark MLlib ALS | Scala / Python / Java | distributed ALS for both explicit and implicit feedback (implicitPrefs=True) |
implicit (Ben Frederickson) | Python with C extensions | fast ALS, BPR, and logistic MF for implicit feedback datasets |
| LibFM | C++ | factorization machines by Steffen Rendle; supports SGD, ALS, MCMC |
| LibMF | C++ | high-performance parallel SGD for matrix factorization |
| TensorFlow Recommenders | Python | modern factorization models, retrieval, and two-tower architectures |
Deep learning displaced classical matrix factorization in many production systems during the late 2010s, but the user matrix concept did not disappear. It evolved into the user tower of two-tower retrieval models.
| approach | how the user representation is built | where it is used |
|---|---|---|
| classical MF | one trained row of U per user (a lookup) | Netflix Prize, Surprise, Spark ALS, most baselines |
| Neural Collaborative Filtering (NCF) | user embedding fed through an MLP that also takes the item embedding | He et al., WWW 2017 |
| Wide and Deep | embedding for low-dimensional latent factors plus wide cross features | Cheng et al., Google, RecSys 2016 |
| DLRM | sparse user embeddings combined with dense features and feature crosses | Naumov et al., Facebook, 2019 |
| two-tower model | user tower is a neural network mapping user features and history to an embedding; item tower does the same | YouTube, TikTok, Spotify, Pinterest, Google retrieval |
| Graph neural recommenders (PinSage, LightGCN) | user node embeddings are aggregated from the user-item interaction graph | Pinterest, Alibaba |
| LLM-based recommenders | user representation is a prompt or learned soft prompt; item retrieval is by language modeling | research and early production at TikTok, Meta, Amazon |
The two-tower architecture is the most direct descendant of classical matrix factorization. The user tower replaces the static row of U with a function of user features (history, demographics, recent queries) so that new users can be represented immediately. The dot product between the user-tower output and the item-tower output is the same scoring function as U_u · V_i, and the resulting embeddings are still indexed by approximate nearest neighbor search. In that sense, the user matrix has not been replaced; it has been generalized.
The user matrix in its classical form has well-known weaknesses:
Despite these limitations, the user matrix remains the conceptual baseline for almost every academic recommender system paper, the de facto starting point for industry teams building a new recommender, and a building block inside larger neural systems. When practitioners say "learn user embeddings," they almost always mean some descendant of U.
Imagine a giant grid where every row is a person at a candy party and every column is a different type of candy. Most squares are empty because most people have not tried most candies. The user matrix is a much smaller grid that gives each person a short list of secret numbers describing what kinds of candy they probably like (sweet, sour, chocolatey, chewy, and so on). The candy gets its own short list of secret numbers too. To guess whether someone will like a candy, you multiply their numbers with the candy's numbers and add them up. The bigger the answer, the better the match. The user matrix is the part of the system that holds everyone's secret-numbers card.