See also: Machine learning terms
In collaborative filtering and matrix factorization recommender systems, the item matrix (commonly written V, sometimes Q or H) is the matrix of latent-factor vectors for items. Each row is one item's embedding in a low-dimensional latent space, and each column is one of the latent factors learned during training. Together with the user matrix U (sometimes written P), the item matrix is used to approximate the very large, very sparse user-item interaction matrix R that records which users have interacted with which movies, products, songs, articles, or videos.
The item matrix is the item-side half of the basic matrix factorization model that powered most production recommendation system work from the mid-2000s onward. In modern neural retrieval architectures, the same lookup table survives as the item-tower output or the item-embedding bank that approximate nearest neighbor (ANN) indexes serve at inference time.
Let there be N users and M items, and let R be the N by M interaction matrix whose entry r_{u,i} is user u's rating, click, view, or purchase of item i (or missing if no interaction is recorded). Matrix factorization picks a small latent dimension k (typically 16 to 256) and learns two matrices:
The interaction matrix is approximated by the product
R ≈ U V^T
so the predicted score for user u on item i is the dot product
̂r_{u,i} = U_u · V_i = Σ_{f=1..k} U_{u,f} V_{i,f}.
A more accurate version adds bias terms:
̂r_{u,i} = μ + b_u + b_i + U_u · V_i
where μ is the global mean rating, b_u is a user bias, and b_i is the item bias. The item bias is one of the most useful pieces of the model in practice; it captures an item's inherent popularity or quality so that V_i does not have to absorb "everyone tends to rate this thing high." Koren, Bell, and Volinsky introduced this baseline-plus-bias formulation in their 2009 IEEE Computer survey, and it remains the default in libraries such as Surprise and Spark MLlib.
The rows of V have no fixed meaning. After training they may loosely correspond to genres, price tiers, audience demographics, or production quality, but the optimizer is free to use the dimensions however it wants. This is the central trade-off of model-based collaborative filtering: instead of relying on hand-engineered item features, V is whatever low-rank summary of the interaction data minimizes prediction error.
The item matrix is sometimes confused with the broader item feature matrix used in content-based filtering. They are different objects.
| matrix | shape | what each row represents | how columns are defined |
|---|---|---|---|
| item matrix V (matrix factorization) | M by k | latent factor vector for one item | learned during training; columns have no fixed meaning |
| item feature matrix (content-based) | M by d | hand-engineered or pretrained features for one item | columns are explicit attributes (genre, price, TF-IDF tokens, image embedding dimensions) |
| rating / interaction matrix R | N by M | one user's interactions across all items | columns are items (one column per item in the catalog) |
| item-item similarity matrix | M by M | similarities between item i and every other item | columns are items; cells are cosine or Pearson similarities |
The content-based item feature matrix and the matrix-factorization item matrix can be combined. Hybrid models concatenate them, or train a neural item tower whose inputs are the content features and whose output replaces V_i.
There is no single algorithm. Different families of methods produce item matrices with very different properties.
| method | year and reference | how V is produced |
|---|---|---|
| Funk SVD | Funk, 2006 ("Try This At Home" blog) | jointly learn U and V by stochastic gradient descent on observed ratings, one latent factor at a time |
| regularized SVD with bias | Paterek, 2007 KDD Cup | add global mean and item bias b_i; V is regularized with L2 |
| ALS for matrix factorization | Zhou et al., 2008; Hu, Koren, Volinsky, ICDM 2008 | hold V fixed and solve for U by least squares; swap and solve for V; closed-form per row, parallelizable |
| SVD++ | Koren, KDD 2008 | adds a per-item implicit factor y_j; the user vector becomes U_u plus a sum of y_j over interacted items |
| probabilistic matrix factorization | Salakhutdinov and Mnih, NIPS 2007 | MAP estimation with Gaussian priors on U and V |
| non-negative matrix factorization (NMF) | Lee and Seung, Nature 1999 | both U and V are constrained non-negative, often producing more interpretable item factors |
| Item2Vec | Barkan and Koenigstein, IEEE MLSP 2016 | apply word2vec skip-gram with negative sampling to sequences of items co-purchased or co-consumed by the same user |
| Prod2Vec | Grbovic et al., KDD 2015 | word2vec on sequences of products from email purchase receipts at Yahoo |
| Meta-Prod2Vec | Vasile et al., RecSys 2016 | Prod2Vec regularized with item metadata, addressing cold start |
| Neural Collaborative Filtering | He et al., WWW 2017 | item embedding fed through an MLP that also takes the user embedding |
| two-tower model | Yi et al., RecSys 2019 (YouTube) | item tower is a neural net mapping item content features to V_i; user tower mirrors it |
| PinSage | Ying et al., KDD 2018 (Pinterest) | graph convolutional network over the user-pin-board graph; V_i aggregates neighbor information |
Funk SVD, ALS, and SVD++ produce V as a free lookup table: every item in the catalog gets a row whose values are learned independently. Item2Vec, Prod2Vec, and the two-tower variants instead make V a function of something else (co-occurrence statistics or item content), which is usually what makes them workable for catalogs with constant churn.
Item2Vec is the most influential reframing of the item matrix. Barkan and Koenigstein's 2016 paper observed that the user-item interaction log has the same shape as a text corpus: each user's session or basket is a "sentence," and each item is a "word." Running word2vec skip-gram with negative sampling (SGNS) on those sequences yields an item matrix whose rows behave like word vectors. Cosine similarity between two rows of V approximates how often the items appear together in user histories, and the geometry supports analogy-style queries.
The attractive properties carried over from word embeddings:
Prod2Vec, published a year earlier by Grbovic et al. at Yahoo, applied the same idea to email purchase receipts. The Yahoo Mail team treated each user's receipts as a basket of products and trained product vectors with skip-gram. Their followup, Meta-Prod2Vec, regularized the embeddings with product metadata so that new SKUs with no purchase history could still be placed in the embedding space.
The practical lesson, repeated by Airbnb's listing embeddings (Grbovic and Cheng, KDD 2018), Spotify's track embeddings, and Alibaba's product embeddings, is that the item matrix does not have to be fit jointly with a user matrix. Treating sessions as sentences and running a skip-gram-style objective is often enough.
The item bias term b_i in the rating equation deserves its own attention because it absorbs an enormous amount of the variance in real data. On the Netflix Prize data, simply predicting the global mean plus user bias plus item bias already beat Cinematch on a sizeable fraction of the held-out ratings, before any U or V is even trained. The reason is intuitive: a Pixar movie is rated higher than a direct-to-DVD release by almost every user, and that effect has nothing to do with personalization.
Separating popularity into b_i has two practical benefits. First, the latent factors V_i are freed up to encode taste differences instead of being burned on "this movie is good in general." Second, the bias terms can be served and updated independently of the heavier latent matrices, which is convenient in production systems that update popularity counts every few minutes but only retrain V nightly or weekly.
A closely related concern is popularity bias in recommendations. Because head items appear far more often in training data than long-tail items, both classical matrix factorization and neural retrieval tend to over-represent the head at inference time. A 2024 survey by Klimashevskaia et al. in User Modeling and User-Adapted Interaction documents the size of the effect across MovieLens, Yelp, and Amazon datasets. Mitigations include:
Exposure fairness is a related but distinct concern, especially on platforms where item providers depend on impressions for revenue. Researchers have proposed (α, β)-fairness constraints that require similar items to receive similar coverage, embedded into the matrix factorization objective.
Once V is trained, the item matrix is most often used not for predicting ratings but for finding similar items. Two items i and j are considered similar if their rows have a high cosine similarity:
sim(i, j) = (V_i · V_j) / (‖V_i‖ ‖V_j‖)
This is the engine behind "customers who bought this also bought" carousels, "more like this" panels, related-video sidebars, and content-discovery feeds. Item-item retrieval was popularized by Amazon's 2003 paper.
Greg Linden, Brent Smith, and Jeremy York's Amazon.com Recommendations: Item-to-Item Collaborative Filtering in IEEE Internet Computing (vol. 7, no. 1, 2003, pp. 76 to 80) showed that comparing items rather than users scales much better. With M items the offline cost is O(M^2) at worst (and far less in practice because most pairs share no common buyers), while the online cost per recommendation is O(k) where k is the number of items the customer has bought. The same paper showed that item neighborhoods can be precomputed daily for the entire Amazon catalog. In 2017, IEEE Internet Computing's editorial board picked this paper as the single most influential article from the journal's first 20 years.
The item matrix V is a learned drop-in replacement for the explicit item-similarity matrix in Amazon-style item-to-item CF. Instead of computing cosine similarity over very long sparse interaction columns, the system computes it over short dense V_i rows. This is far cheaper at query time and supports approximate nearest neighbor search.
Production systems with millions of items cannot exhaustively compute V_u · V_i for the whole catalog at every request. The item matrix is therefore loaded into a vector index that supports approximate nearest neighbor search.
| index / library | developed by | indexing strategy | typical use |
|---|---|---|---|
| FAISS | Meta (Facebook AI Research), open source 2017 | IVF, HNSW, product quantization, GPU support | item retrieval at Meta, many open-source recommenders |
| ScaNN | Google, 2020 | anisotropic vector quantization plus tree search | YouTube two-tower retrieval, Google Cloud Vertex AI |
| HNSWlib | Malkov and Yashunin, 2016 | hierarchical navigable small world graphs | Spotify, Pinecone, Weaviate, Qdrant |
| Annoy | Spotify, 2013 | random-projection forests | Spotify radio, smaller-scale services |
| Pinecone, Weaviate, Milvus, Qdrant | various | managed vector databases wrapping HNSW or IVF | hosted item retrieval and semantic search |
A modern recommender typically stores the item matrix as flat vectors in one of these indexes, recomputes it nightly or hourly during retraining, and serves nearest-neighbor queries with sub-100ms latency over indexes with hundreds of millions of items. The two-tower YouTube system described by Yi et al. (2019) is an explicit example: the item tower output is dumped into a ScaNN index that handles tens of millions of videos, and the served user tower vector is matched against that index in real time.
Every row of V corresponds to an item with at least some interaction history. A brand-new item has no history, so there is nothing to fit V_i to and the row would be set to the regularizer's prior (close to zero). This is the item-side cold start problem, and it is generally more painful than its user-side counterpart because most platforms add new items continuously while users churn more slowly.
The usual mitigations:
In practice most large platforms run a hybrid: a content-only embedding handles the first hours or days of an item's life, then the system blends in collaborative signal as views and clicks accumulate, and eventually the collaborative signal dominates.
It is worth being explicit about how the matrix-factorization item matrix differs from the item representation in pure content-based filtering.
| dimension | matrix factorization V | content-based filtering item vector |
|---|---|---|
| source of features | learned from user-item interactions | derived from item attributes (text, image, metadata) |
| handles new items? | no, cold start | yes, immediately |
| handles new users? | no without bootstrap | yes if user profile features exist |
| captures taste signals invisible in content? | yes (e.g., a movie loved by a particular cluster of viewers) | no, limited to content similarity |
| catalog growth cost | retraining required as items are added | embedding can be computed for a single new item in isolation |
| typical use today | ID-based retrieval, similar-item carousels | item tower in two-tower models, cold-start fallback |
Hybrid recommender architectures combine both. The item tower in a modern two-tower model takes content features as input but is trained on user-item interaction data, so V_i ends up encoding both content similarity and collaborative signal in the same vector.
The item matrix shows up by name or by direct analogy in almost every published large-scale recommender architecture.
| system | reference | role of the item matrix |
|---|---|---|
| Amazon item-to-item CF | Linden, Smith, York, IEEE Internet Computing 2003 | precomputed item-item similarity table; later replaced by learned V in Amazon's neural retrieval |
| Netflix Prize matrix factorization | Koren, Bell, Volinsky, IEEE Computer 2009 | learned V with bias term b_i; central to BellKor's Pragmatic Chaos winning blend |
| YouTube two-tower retrieval | Yi et al., RecSys 2019 | item tower output is V_i; loaded into a ScaNN index for serving tens of millions of videos on YouTube |
| Pinterest PinSage | Ying et al., KDD 2018 | V_i aggregated from a graph convolutional network over 3 billion pins and boards |
| Airbnb listing embeddings | Grbovic and Cheng, KDD 2018 | listing-as-word skip-gram, including dwell-time and conversion as positive signals; powers "similar listings" |
| Spotify track embeddings | Spotify engineering blog, multiple posts | item matrix learned from user listening sessions; used for radio and Discover Weekly |
| Alibaba EGES | Wang et al., KDD 2018 | enhanced graph embedding with side information for taobao items |
| Meta DLRM | Naumov et al., 2019 | item embedding tables combined with dense features and feature crosses |
| Twitter / X SimClusters | Twitter engineering 2020 | sparse item embeddings over learned community memberships |
| TikTok recommendation | various engineering posts | two-tower retrieval over short-video item embeddings, refreshed continuously |
Many of these systems do not call V the "item matrix" anymore. They call it the item embedding table, the item tower, or simply "the index." Architecturally it is the same thing.
The latent dimension k controls capacity. Common ranges:
| dataset / system | typical k | notes |
|---|---|---|
| MovieLens 100k / 1M (research benchmarks) | 10 to 50 | enough for the small catalog; larger k overfits |
| Netflix Prize | 50 to 200 | the BellKor team used several k values in their blend |
| medium e-commerce (~10^5 SKUs) | 32 to 128 | balances offline RMSE and serving cost |
| YouTube and TikTok scale (10^8+ items) | 64 to 256 | bottlenecked by ANN index size and serving latency |
| word2vec-style item embeddings | 100 to 300 | inherits the word2vec convention |
Larger k captures more nuance per item but increases memory, slows ANN search, and demands more regularization to avoid overfitting on items with little data. The L2 regularizer λ ‖V_i‖^2 in the matrix factorization loss shrinks rows with few interactions toward zero, which is one of the reasons cold-start items end up indistinguishable until enough signal accumulates.
| library | language | item-matrix support |
|---|---|---|
| Surprise | Python | SVD, SVD++, NMF, KNN baselines; V exposed as qi after training |
Spark MLlib ALS | Scala / Python / Java | distributed ALS for explicit and implicit feedback; item factors fetched via model.itemFactors |
implicit (Ben Frederickson) | Python with C extensions | fast ALS, BPR, and logistic MF; model.item_factors is V |
| LightFM | Python | hybrid model that combines latent and content features in V; supports cold start |
| LibFM | C++ | factorization machines (Steffen Rendle); generalizes V to arbitrary feature interactions |
| TensorFlow Recommenders (TFRS) | Python | modern factorization, retrieval, and two-tower architectures with explicit item tower API |
| Gensim Word2Vec | Python | trains item2vec / prod2vec by treating sessions as sentences |
| PyTorch BigGraph | Python / C++ | knowledge graph and item embedding training at billions-of-nodes scale (Meta) |
| FAISS, ScaNN, HNSWlib | C++ / Python | not training, but serving: store V in an ANN index for sub-100ms retrieval |
The item matrix did not disappear with the deep learning wave; it generalized. In a modern two-tower retrieval system, the item tower replaces the static lookup row with a function of item features, but the output is still a vector that lives in the same space as the user vector and is retrieved by dot product.
Three trends are visible in 2024 to 2026 production systems:
Despite all these changes, the underlying math is the same as Funk's 2006 blog post: a low-rank latent vector for each item, scored against a user vector by dot product.
Imagine a giant grid where every row is a movie and every column is a person at the theater. Most squares are empty because most people have not seen most movies. The item matrix is a much smaller grid that gives each movie its own short list of secret numbers describing what kind of movie it is (scary, funny, long, has lots of action, and so on). Every person has their own secret list of numbers too. To guess whether a person will like a movie, you multiply their numbers by the movie's numbers and add them up. The item matrix is the part of the system that holds the movie cards.