Latent semantic analysis (LSA)
Last reviewed
Apr 30, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,997 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 30, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,997 words
Add missing citations, update stale details, or suggest a clearer explanation.
Latent semantic analysis (LSA), called latent semantic indexing (LSI) in information retrieval contexts, is an unsupervised technique for analysing the relationships between a set of documents and the terms they contain. The method works by producing a low-dimensional vector representation of both terms and documents, derived from the truncated singular value decomposition of a term-document matrix. Terms that occur in similar documents and documents that share similar terms map to nearby points in the resulting low-dimensional "semantic space", which means synonyms cluster together even when they never co-occur in the same document.[^deerwester1990][^landauer1998]
LSA was developed at Bell Communications Research (Bellcore) in the late 1980s by Scott Deerwester, Susan Dumais, George Furnas, Thomas Landauer, Richard Harshman, Karen Lochbaum and Lynn Streeter. The underlying method was patented in September 1988 (US patent 4,839,853, "Computer information retrieval using latent semantic structure"), and the foundational journal paper, "Indexing by Latent Semantic Analysis", appeared in the Journal of the American Society for Information Science in 1990.[^deerwester1990][^uspto4839853] More than three decades later, LSA remains a widely taught baseline in NLP and information retrieval, a useful low-cost alternative when neural embeddings are out of reach, and a clean illustration of how linear algebra can carry semantic content.
Let a corpus contain m unique terms across n documents. Build a term-document matrix A of shape m x n whose entry A_ij is some weighted count of how often term i appears in document j. The cells are typically TF-IDF weighted, although raw counts, log-frequency weighting and entropy-based weighting all appear in the literature.[^berry1995][^manning2008] LSA then computes the truncated singular value decomposition of A:
A ~ U_k Σ_k V_k^T
where k is much smaller than the rank of A (typically a few hundred for collections with tens of thousands of documents).[^deerwester1990][^berry1995] The three factors have direct semantic interpretations:
A query q is represented as a pseudo-document, that is, as a m-dimensional bag-of-words (or bag of words) vector with the same TF-IDF weighting as the original columns of A. It is projected into the latent space by
q_hat = Σ_k^{-1} U_k^T q
and compared to existing document vectors using cosine similarity.[^manning2008][^deerwester1990] The top-k nearest documents in this low-dimensional space are returned as the search result. Document-document similarity is computed as the cosine of the corresponding columns of Σ_k V_k^T, while term-term similarity uses the rows of U_k Σ_k.
The truncation step is what gives LSA its semantic character. The full rank reproduces A exactly and therefore behaves like a literal keyword match. Truncating to k dimensions discards the smallest singular values, which carry the most idiosyncratic and noisy variation in word usage, and keeps the dominant correlations between term occurrences. A document about "cars" and a document about "automobiles" can end up close in the latent space even if they share no surface vocabulary, because both correlate with the same set of supporting terms (engine, drive, road, fuel) and these correlations dominate the leading singular vectors.[^deerwester1990][^landauer1998]
The rows of A correspond to vocabulary terms, the columns to documents. The choice of cell weighting matters more than most beginners expect. Raw counts work poorly because frequent terms dominate. The standard recipe in the original Bellcore work and in Berry, Dumais and O'Brien (1995) is to combine a local term weight (typically log(1 + tf)) with a global term weight, often inverse document frequency or one minus normalised entropy across documents, and then optionally normalise each document column to unit length.[^berry1995][^deerwester1990] Modern implementations such as scikit-learn's TruncatedSVD pipeline almost always feed the SVD a TF-IDF matrix produced by TfidfVectorizer.[^sklearn_tsvd]
For any real m x n matrix the singular value decomposition factors A as A = U Σ V^T, where U is m x r with orthonormal columns, V is n x r with orthonormal columns, Σ is r x r diagonal with non-negative entries σ_1 >= σ_2 >= ... >= σ_r >= 0, and r is the rank of A.[^manning2008] Keeping the top k singular values and the corresponding columns of U and V gives the best rank-k approximation of A in both the Frobenius norm and the spectral norm, by the Eckart-Young-Mirsky theorem. This is the precise sense in which LSA produces an optimal low-rank summary of the term-document matrix.
After truncation, term i is represented by row i of U_k and document j by column j of V_k^T. Multiplying by Σ_k to get U_k Σ_k or Σ_k V_k^T scales each latent dimension by its singular value, which is the standard form for taking cosines in the original Deerwester paper.[^deerwester1990] Geometrically, the original term-document matrix is replaced by its projection onto the k-dimensional subspace that captures the most variance, and all subsequent retrieval is done in that subspace.
A query is a pseudo-document; it is encoded as an m-vector and projected into the latent space using q_hat = Σ_k^{-1} U_k^T q. New documents that arrive after the SVD has been computed can be "folded in" the same way, although strictly speaking this distorts the orthogonality properties of the basis. For corpora that change a lot, the SVD has to be recomputed periodically or maintained with an incremental SVD algorithm such as Brand's 2003 method.[^manning2008]
The number of latent dimensions k is the only real hyperparameter in LSA. Bradford's 2008 study and Landauer's experiments suggest that around 300 dimensions usually work well for moderate-sized collections, with a useful range of roughly 100 to 500 for typical document sets and up to 1,000 for very large corpora.[^bradford2008][^landauer2008] Berry, Dumais and O'Brien quote 200 to 300 as the standard for retrieval applications.[^berry1995] Manning, Raghavan and Schutze report that LSI experiments at Bellcore in the early 1990s used Lanczos-based partial SVDs to compute decompositions of tens of thousands of documents in roughly a day on the hardware of the time, and that the resulting precision was at or above the median TREC participant for several test collections.[^manning2008]
Values of k that are too small lose discriminating power and collapse genuinely different documents onto the same vector. Values that are too large bring back the noise that the truncation was supposed to remove. There is no universally optimal choice, but most published experiments use values in the low hundreds and report only mild sensitivity within that range.
In raw bag-of-words information retrieval, two documents that mean the same thing can have nearly disjoint vocabulary. A document that says "car repair" and a document that says "automobile maintenance" share no terms, so their cosine similarity in the original term space is zero. This is the synonymy problem: many words can express the same concept, and direct keyword matching cannot tell that they do.
LSA partially solves synonymy. Because both "car" and "automobile" tend to co-occur with the same supporting terms (engine, mechanic, oil), the SVD discovers a latent dimension that loads strongly on all of them. Documents that use either word end up close in the projected space. The same logic applies to spelling variants, near-synonyms across registers, and translations within bilingual corpora, and is the basis for the cross-lingual LSI experiments first reported by Dumais and colleagues in 1996.[^landauer1998][^dumais2004]
LSA only partially solves the polysemy problem, the converse of synonymy where the same word has multiple meanings. "Bank" can mean a financial institution or the side of a river, but LSA assigns a single vector to "bank" that is essentially an average of all its contextual uses. If both senses occur in the corpus the resulting vector is a compromise that may match neither sense well. This is a fundamental limitation of any one-vector-per-word model and was a major motivation for the contextual embeddings produced by transformer encoders such as BERT, where the vector for "bank" depends on the surrounding sentence.[^landauer1998]
The move from an m-dimensional vocabulary to a k-dimensional latent space also addresses the curse of dimensionality that afflicts naive vector-space retrieval. In the raw term space, document vectors are sparse and high dimensional; cosines are dominated by which words happen to co-occur. After projection into a few hundred dense dimensions, every document has a non-trivial coordinate on every axis, and the geometry of similarity becomes much more informative.
The original target application. A query is encoded as a pseudo-document, projected into the latent space, and ranked by cosine similarity against the document vectors. The Bellcore team reported substantial precision improvements over keyword matching on the MED, CISI and CRAN test collections, and Berry, Dumais and O'Brien (1995) document larger-scale experiments on collections of up to 1 million documents using the SVDPACKC software they wrote at the University of Tennessee.[^berry1995][^deerwester1990]
Document vectors from LSA can be fed into standard clustering algorithms (k-means, hierarchical clustering) or used directly to compute pairwise similarities. Because the latent representation is dense and low dimensional, distance computations are much cheaper than in the raw term space.
Thomas Landauer's group commercialised LSA-based essay scoring through Knowledge Analysis Technologies, which Pearson Education acquired in 2004 to form Pearson Knowledge Technologies. Their Intelligent Essay Assessor (IEA) trains an LSA semantic space on a large reference corpus and then scores student essays by their cosine similarity to a small set of human-graded reference essays. Pearson reports that IEA needs roughly 100 pre-scored essays per prompt to train, compared with 300 to 500 essays per prompt for older systems, and that its scores agree with human raters about as well as human raters agree with each other.[^pearson_iea][^foltz1999] IEA powered Pearson's WriteToLearn educational product and its large-scale automated scoring services for state and national exams.
The term-document matrix is structurally identical to a user-item matrix in recommender systems, which made LSA an early prototype for matrix factorization approaches to collaborative filtering. Latent factor models that became standard for movie and product recommendation, including the SVD-based methods that won the Netflix Prize in 2009, descend directly from this idea.
The rows of U_k are dense vector representations of terms with the property that semantically similar terms have similar vectors. They are, in effect, an early form of word embedding, produced by SVD on co-occurrence statistics rather than by neural training. This is more than a retrospective analogy: Levy and Goldberg's 2014 paper Neural Word Embedding as Implicit Matrix Factorization showed that word2vec's skip-gram with negative sampling is implicitly factorising a shifted pointwise mutual information matrix, putting it in the same mathematical family as LSA on a different cell weighting.[^levy2014]
LSA sits at the head of a long line of techniques for building low-dimensional representations of documents and words. The table below summarises the main relatives.
| Method | Year | Type | Mathematical core | Notes |
|---|---|---|---|---|
| LSA / LSI | 1988-1990 | Linear, deterministic | Truncated SVD on TF-IDF term-document matrix | Original method, Bellcore patent and Deerwester et al. 1990[^deerwester1990] |
| pLSA | 1999 | Probabilistic mixture | Multinomial mixture model trained by EM on word-document counts | Hofmann's UAI/SIGIR 1999 papers; addresses LSA's lack of a generative semantics[^hofmann1999] |
| LDA | 2003 | Bayesian generative | Three-level hierarchical Bayesian topic model with Dirichlet priors | Blei, Ng and Jordan 2003 in JMLR; the standard topic model[^blei2003] |
| NMF | 1999 | Linear, deterministic | Non-negative factorisation A ~ WH with W,H >= 0 | Lee and Seung 1999 in Nature; produces parts-based, interpretable factors[^leeseung1999] |
| word2vec | 2013 | Neural, predictive | Shallow neural network predicting context words (CBOW or skip-gram) | Mikolov et al. 2013; equivalent to factorising a shifted PMI matrix[^mikolov2013][^levy2014] |
| GloVe | 2014 | Linear, count-based | Weighted least-squares factorisation of log word-word co-occurrence matrix | Pennington et al. 2014; explicitly hybridises count and predict approaches[^pennington2014] |
| Sentence-BERT | 2019 | Contextual neural | Siamese BERT fine-tuned on sentence-pair tasks | Reimers and Gurevych 2019; one vector per sentence with full transformer context[^reimers2019] |
| Modern dense retrievers (DPR, ColBERT, E5, etc.) | 2020-present | Contextual neural | Dual-encoder transformers trained on (query, relevant document) pairs | State of the art for retrieval, but require GPUs and large training corpora |
The broad direction of travel is from one-shot linear factorisations of count statistics (LSA, NMF) toward probabilistic generative models (pLSA, LDA), and then to neural embeddings that condition on full context (Sentence-BERT, modern dense retrievers). LSA is the simplest and oldest member of this family, and several of the others reduce to LSA in special limits.
LSA persists as a teaching staple and a practical baseline because it is genuinely good at certain things.
LSA is also clearly a 1990s technique with limitations that motivated everything that came after it.
For anything beyond toy corpora, the cost of LSA is dominated by the SVD step. Three families of algorithms are used in practice.
Lanczos-based partial SVD. ARPACK and its successors compute the top k singular triples of a sparse matrix by Lanczos bidiagonalisation, requiring only matrix-vector products. This is the route used in MATLAB's svds and SciPy's scipy.sparse.linalg.svds. It was the standard at Bellcore in the 1990s; Manning, Raghavan and Schutze quote experiments where LSI on tens of thousands of documents took roughly a day on a single machine using this approach.[^manning2008]
Randomised SVD. Halko, Martinsson and Tropp's 2011 SIAM Review paper Finding Structure with Randomness introduced a modular framework that approximates the top k singular vectors using a small number of random projections. The method requires only 2(q+1) passes over the matrix and is markedly faster than Lanczos for large matrices with rapidly decaying singular spectra.[^halko2011] It is the default solver in scikit-learn's TruncatedSVD.
Incremental SVD. Brand's 2003 algorithm and its descendants update an existing SVD when rows or columns are added or removed, without recomputing from scratch. Gensim's LsiModel ships an online version that can stream very large corpora through bounded memory, suitable for collections that cannot fit in RAM.[^gensim_lsi]
| Library | Function or class | Notes |
|---|---|---|
| scikit-learn (Python) | sklearn.decomposition.TruncatedSVD | Default LSA implementation; supports ARPACK and randomised SVD solvers; recommended n_components=100 for LSA[^sklearn_tsvd] |
| Gensim (Python) | gensim.models.LsiModel | Streaming, online, distributed-friendly; built for corpora that exceed RAM[^gensim_lsi] |
| MATLAB | svds, svd | svds is the standard partial-SVD routine for sparse matrices |
| R | irlba package | Implicitly Restarted Lanczos Bidiagonalisation; the standard partial-SVD package in R |
| SVDPACKC | C library | The original Bellcore-Tennessee software used in Berry, Dumais, O'Brien 1995[^berry1995] |
| Apache Spark | RowMatrix.computeSVD | Distributed truncated SVD for very large matrices |
In addition to its founding role in concept-based information retrieval, LSA found significant use in education through Pearson Knowledge Technologies' Intelligent Essay Assessor and the WriteToLearn product line, which used LSA-based scoring for state and national K-12 assessments.[^pearson_iea] The method was also adopted in library catalog systems, intelligence community text analysis (the SAIC Latent Semantic Indexing Engine was first deployed in 1999), patent prior art retrieval, and electronic discovery software for litigation support.[^dumais2004]
In machine learning research the influence of LSA shows up indirectly. Latent factor models for collaborative filtering, the Netflix Prize winners, and modern recommender systems all rest on essentially the same SVD-on-a-sparse-matrix idea. Levy and Goldberg's 2014 result that word2vec's skip-gram with negative sampling implicitly factorises a shifted PMI matrix made the kinship explicit: word2vec is, in a precise sense, LSA with a different cell weighting and a different optimisation procedure.[^levy2014] GloVe (Pennington et al. 2014) makes the connection more explicit still by directly factorising a weighted log-co-occurrence matrix.[^pennington2014]
For state-of-the-art retrieval and topic modelling, LSA has been superseded by neural methods. Sentence-BERT and dense retrievers built on transformer encoders dominate retrieval benchmarks; LDA and its neural-variational descendants dominate topic modelling.[^reimers2019][^blei2003] Most contemporary NLP papers report LSA only as a baseline.
That does not make LSA obsolete. It is still the natural first thing to try when faced with a new text collection, particularly when computational budgets, training data or expertise rule out a transformer pipeline. It runs on a laptop, requires no labelled data, and gives interpretable, reproducible results. The mathematics it relies on (TF-IDF weighting, sparse matrices, the singular value decomposition, cosine similarity) are foundational pieces of applied linear algebra that turn up everywhere in machine learning, so understanding LSA is understanding several of the core building blocks of the field. As a teaching example it is unusually clean: the entire algorithm fits in a few lines of NumPy, the geometric intuition is visible, and the failure modes (polysemy, bag-of-words ignorance of order) directly motivate the historical sequence of methods that followed.
Bag of words, cosine similarity, embeddings, information retrieval, matrix factorization, NLP, singular value decomposition, TF-IDF, word embedding.