Latent semantic analysis (LSA)
Last reviewed
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 4,397 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 4,397 words
Add missing citations, update stale details, or suggest a clearer explanation.
Latent semantic analysis (LSA), called latent semantic indexing (LSI) in information retrieval contexts, is an unsupervised technique that maps both terms and documents into a shared low-dimensional vector space by taking the truncated singular value decomposition of a term-document matrix. Because the decomposition keeps only the dominant correlations in word usage, terms that mean the same thing (for example "car" and "automobile") land near each other in the reduced space even when they never appear in the same document, which lets retrieval match by concept rather than by exact keyword. LSA was introduced by a team at Bell Communications Research (Bellcore) in a 1988 patent and the 1990 paper "Indexing by Latent Semantic Analysis", and it is widely regarded as a direct precursor to modern word and document embeddings.123
The founding paper described the method plainly: it "takes advantage of implicit higher-order structure in the association of terms with documents ('semantic structure')" using "singular-value decomposition, in which a large term by document matrix is decomposed into a set of ca. 100 orthogonal factors from which the original matrix can be approximated by linear combination."14 That 1990 paper is among the most cited works in information retrieval; the indexing service SciSpace records more than 13,000 citations for it.5
LSA was developed in the late 1980s by Scott Deerwester, Susan Dumais, George Furnas, Thomas Landauer, Richard Harshman, Karen Lochbaum and Lynn Streeter. The underlying method was filed as a US patent on 15 September 1988 (US patent 4,839,853, "Computer information retrieval using latent semantic structure", granted 13 June 1989), and the foundational journal paper appeared in the Journal of the American Society for Information Science in 1990, in volume 41, issue 6, pages 391-407.13 More than three decades later, LSA remains a widely taught baseline in natural language processing and information retrieval, a useful low-cost alternative when neural embeddings are out of reach, and a clean illustration of how linear algebra can carry semantic content.
Let a corpus contain m unique terms across n documents. Build a term-document matrix A of shape m x n whose entry A_ij is some weighted count of how often term i appears in document j. The cells are typically TF-IDF weighted, although raw counts, log-frequency weighting and entropy-based weighting all appear in the literature.67 LSA then computes the truncated singular value decomposition of A:
A ~ U_k Σ_k V_k^T
where k is much smaller than the rank of A (typically a few hundred for collections with tens of thousands of documents; the original Bellcore work used about 100 factors).16 The three factors have direct semantic interpretations:
A query q is represented as a pseudo-document, that is, as a m-dimensional bag-of-words (or bag of words) vector with the same TF-IDF weighting as the original columns of A. It is projected into the latent space by
q_hat = Σ_k^{-1} U_k^T q
and compared to existing document vectors using cosine similarity.71 The top-k nearest documents in this low-dimensional space are returned as the search result. Document-document similarity is computed as the cosine of the corresponding columns of Σ_k V_k^T, while term-term similarity uses the rows of U_k Σ_k.
The truncation step is what gives LSA its semantic character. The full rank reproduces A exactly and therefore behaves like a literal keyword match. Truncating to k dimensions discards the smallest singular values, which carry the most idiosyncratic and noisy variation in word usage, and keeps the dominant correlations between term occurrences. A document about "cars" and a document about "automobiles" can end up close in the latent space even if they share no surface vocabulary, because both correlate with the same set of supporting terms (engine, drive, road, fuel) and these correlations dominate the leading singular vectors.12
The psychologist Thomas Landauer, one of the inventors, framed LSA as more than an engineering trick. In the widely read 1998 introduction, he and his coauthors defined it as "a theory and method for extracting and representing the contextual-usage meaning of words by statistical computations applied to a large corpus of text."2
The rows of A correspond to vocabulary terms, the columns to documents. The choice of cell weighting matters more than most beginners expect. Raw counts work poorly because frequent terms dominate. The standard recipe in the original Bellcore work and in Berry, Dumais and O'Brien (1995) is to combine a local term weight (typically log(1 + tf)) with a global term weight, often inverse document frequency or one minus normalised entropy across documents, and then optionally normalise each document column to unit length.61 Modern implementations such as scikit-learn's TruncatedSVD pipeline almost always feed the SVD a TF-IDF matrix produced by TfidfVectorizer.8
For any real m x n matrix the singular value decomposition factors A as A = U Σ V^T, where U is m x r with orthonormal columns, V is n x r with orthonormal columns, Σ is r x r diagonal with non-negative entries σ_1 >= σ_2 >= ... >= σ_r >= 0, and r is the rank of A.7 Keeping the top k singular values and the corresponding columns of U and V gives the best rank-k approximation of A in both the Frobenius norm and the spectral norm, by the Eckart-Young-Mirsky theorem. This is the precise sense in which LSA produces an optimal low-rank summary of the term-document matrix.
After truncation, term i is represented by row i of U_k and document j by column j of V_k^T. Multiplying by Σ_k to get U_k Σ_k or Σ_k V_k^T scales each latent dimension by its singular value, which is the standard form for taking cosines in the original Deerwester paper.1 Geometrically, the original term-document matrix is replaced by its projection onto the k-dimensional subspace that captures the most variance, and all subsequent retrieval is done in that subspace.
A query is a pseudo-document; it is encoded as an m-vector and projected into the latent space using q_hat = Σ_k^{-1} U_k^T q. New documents that arrive after the SVD has been computed can be "folded in" the same way, although strictly speaking this distorts the orthogonality properties of the basis. For corpora that change a lot, the SVD has to be recomputed periodically or maintained with an incremental SVD algorithm such as Brand's 2003 method.7
The number of latent dimensions k is the only real hyperparameter in LSA. The original Bellcore paper worked with roughly 100 factors.1 Bradford's 2008 study and Landauer's experiments suggest that around 300 dimensions usually work well for moderate-sized collections, with a useful range of roughly 100 to 500 for typical document sets and up to 1,000 for very large corpora.910 Berry, Dumais and O'Brien quote 200 to 300 as the standard for retrieval applications.6 Manning, Raghavan and Schutze report that LSI experiments at Bellcore in the early 1990s used Lanczos-based partial SVDs to compute decompositions of tens of thousands of documents in roughly a day on the hardware of the time, and that the resulting precision was at or above the median TREC participant for several test collections.7
Values of k that are too small lose discriminating power and collapse genuinely different documents onto the same vector. Values that are too large bring back the noise that the truncation was supposed to remove. There is no universally optimal choice, but most published experiments use values in the low hundreds and report only mild sensitivity within that range.
In raw bag-of-words information retrieval, two documents that mean the same thing can have nearly disjoint vocabulary. A document that says "car repair" and a document that says "automobile maintenance" share no terms, so their cosine similarity in the original term space is zero. This is the synonymy problem: many words can express the same concept, and direct keyword matching cannot tell that they do.
LSA partially solves synonymy. Because both "car" and "automobile" tend to co-occur with the same supporting terms (engine, mechanic, oil), the SVD discovers a latent dimension that loads strongly on all of them. Documents that use either word end up close in the projected space. The same logic applies to spelling variants, near-synonyms across registers, and translations within bilingual corpora, and is the basis for the cross-lingual LSI experiments first reported by Dumais and colleagues in 1996.211
LSA only partially solves the polysemy problem, the converse of synonymy where the same word has multiple meanings. "Bank" can mean a financial institution or the side of a river, but LSA assigns a single vector to "bank" that is essentially an average of all its contextual uses. If both senses occur in the corpus the resulting vector is a compromise that may match neither sense well. This is a fundamental limitation of any one-vector-per-word model and was a major motivation for the contextual embeddings produced by transformer encoders such as BERT, where the vector for "bank" depends on the surrounding sentence.2
The move from an m-dimensional vocabulary to a k-dimensional latent space also addresses the curse of dimensionality that afflicts naive vector-space retrieval. In the raw term space, document vectors are sparse and high dimensional; cosines are dominated by which words happen to co-occur. After projection into a few hundred dense dimensions, every document has a non-trivial coordinate on every axis, and the geometry of similarity becomes much more informative.
The original target application. A query is encoded as a pseudo-document, projected into the latent space, and ranked by cosine similarity against the document vectors. The Bellcore team reported substantial precision improvements over keyword matching on the MED, CISI and CRAN test collections, and Berry, Dumais and O'Brien (1995) document larger-scale experiments on collections of up to 1 million documents using the SVDPACKC software they wrote at the University of Tennessee.61
Document vectors from LSA can be fed into standard clustering algorithms (k-means, hierarchical clustering) or used directly to compute pairwise similarities. Because the latent representation is dense and low dimensional, distance computations are much cheaper than in the raw term space.
Thomas Landauer's group commercialised LSA-based essay scoring through Knowledge Analysis Technologies, which Pearson Education acquired in 2004 to form Pearson Knowledge Technologies. Their Intelligent Essay Assessor (IEA) trains an LSA semantic space on a large reference corpus and then scores student essays by their cosine similarity to a small set of human-graded reference essays. Pearson reports that IEA needs roughly 100 pre-scored essays per prompt to train, compared with 300 to 500 essays per prompt for older systems, and that its scores agree with human raters about as well as human raters agree with each other.1213 IEA powered Pearson's WriteToLearn educational product and its large-scale automated scoring services for state and national exams.
The term-document matrix is structurally identical to a user-item matrix in recommender systems, which made LSA an early prototype for matrix factorization approaches to collaborative filtering. Latent factor models that became standard for movie and product recommendation, including the SVD-based methods that won the Netflix Prize in 2009, descend directly from this idea.
The rows of U_k are dense vector representations of terms with the property that semantically similar terms have similar vectors. They are, in effect, an early form of word embedding, produced by SVD on co-occurrence statistics rather than by neural training. This is more than a retrospective analogy: Levy and Goldberg's 2014 paper Neural Word Embedding as Implicit Matrix Factorization showed that word2vec's skip-gram with negative sampling is implicitly factorising a shifted pointwise mutual information matrix, putting it in the same mathematical family as LSA on a different cell weighting.14
LSA sits at the head of a long line of techniques for building low-dimensional representations of documents and words. The two most-cited probabilistic successors, probabilistic LSA (pLSA) and latent Dirichlet allocation (LDA), were motivated explicitly by limitations of LSA's purely linear-algebraic formulation. pLSA replaces the SVD with a latent-class statistical model trained by expectation maximisation, while LDA adds a full Bayesian generative story with Dirichlet priors over topics. The table below summarises the main relatives.
| Method | Year | Type | Mathematical core | Notes |
|---|---|---|---|---|
| LSA / LSI | 1988-1990 | Linear, deterministic | Truncated SVD on TF-IDF term-document matrix | Original method, Bellcore patent and Deerwester et al. 19901 |
| pLSA | 1999 | Probabilistic mixture | Multinomial mixture model trained by EM on word-document counts | Hofmann's UAI/SIGIR 1999 papers; addresses LSA's lack of a generative semantics16 |
| LDA | 2003 | Bayesian generative | Three-level hierarchical Bayesian topic model with Dirichlet priors | Blei, Ng and Jordan 2003 in JMLR; the standard topic model17 |
| NMF | 1999 | Linear, deterministic | Non-negative factorisation A ~ WH with W,H >= 0 | Lee and Seung 1999 in Nature; produces parts-based, interpretable factors18 |
| word2vec | 2013 | Neural, predictive | Shallow neural network predicting context words (CBOW or skip-gram) | Mikolov et al. 2013; equivalent to factorising a shifted PMI matrix1914 |
| GloVe | 2014 | Linear, count-based | Weighted least-squares factorisation of log word-word co-occurrence matrix | Pennington et al. 2014; explicitly hybridises count and predict approaches20 |
| Sentence-BERT | 2019 | Contextual neural | Siamese BERT fine-tuned on sentence-pair tasks | Reimers and Gurevych 2019; one vector per sentence with full transformer context21 |
| Modern dense retrievers (DPR, ColBERT, E5, etc.) | 2020-present | Contextual neural | Dual-encoder transformers trained on (query, relevant document) pairs | State of the art for retrieval, but require GPUs and large training corpora |
The central contrast is statistical versus algebraic. LSA is model-free: it produces a low-dimensional representation by linear algebra alone, with orthogonal latent axes and no probabilistic interpretation. Hofmann's pLSA (1999) instead treats topics as multinomial word distributions, allows them to be non-orthogonal, and fits them by maximum likelihood, which makes the resulting factors directly interpretable as probabilities. Blei, Ng and Jordan's LDA (2003) goes further, modelling the entire data-generating process with Dirichlet priors so that the topic mixtures generalise to unseen documents.1617 The broad direction of travel is from one-shot linear factorisations of count statistics (LSA, NMF) toward probabilistic generative models (pLSA, LDA), and then to neural embeddings that condition on full context (Sentence-BERT, modern dense retrievers). LSA is the simplest and oldest member of this family, and several of the others reduce to LSA in special limits.
LSA persists as a teaching staple and a practical baseline because it is genuinely good at certain things.
LSA is also clearly a 1990s technique with limitations that motivated everything that came after it.
For anything beyond toy corpora, the cost of LSA is dominated by the SVD step. Three families of algorithms are used in practice.
Lanczos-based partial SVD. ARPACK and its successors compute the top k singular triples of a sparse matrix by Lanczos bidiagonalisation, requiring only matrix-vector products. This is the route used in MATLAB's svds and SciPy's scipy.sparse.linalg.svds. It was the standard at Bellcore in the 1990s; Manning, Raghavan and Schutze quote experiments where LSI on tens of thousands of documents took roughly a day on a single machine using this approach.7
Randomised SVD. Halko, Martinsson and Tropp's 2011 SIAM Review paper Finding Structure with Randomness introduced a modular framework that approximates the top k singular vectors using a small number of random projections. The method requires only 2(q+1) passes over the matrix and is markedly faster than Lanczos for large matrices with rapidly decaying singular spectra.23 It is the default solver in scikit-learn's TruncatedSVD.
Incremental SVD. Brand's 2003 algorithm and its descendants update an existing SVD when rows or columns are added or removed, without recomputing from scratch. Gensim's LsiModel ships an online version that can stream very large corpora through bounded memory, suitable for collections that cannot fit in RAM.24
| Library | Function or class | Notes |
|---|---|---|
| scikit-learn (Python) | sklearn.decomposition.TruncatedSVD | Default LSA implementation; supports ARPACK and randomised SVD solvers; recommended n_components=100 for LSA8 |
| Gensim (Python) | gensim.models.LsiModel | Streaming, online, distributed-friendly; built for corpora that exceed RAM24 |
| MATLAB | svds, svd | svds is the standard partial-SVD routine for sparse matrices |
| R | irlba package | Implicitly Restarted Lanczos Bidiagonalisation; the standard partial-SVD package in R |
| SVDPACKC | C library | The original Bellcore-Tennessee software used in Berry, Dumais, O'Brien 19956 |
| Apache Spark | RowMatrix.computeSVD | Distributed truncated SVD for very large matrices |
In addition to its founding role in concept-based information retrieval, LSA found significant use in education through Pearson Knowledge Technologies' Intelligent Essay Assessor and the WriteToLearn product line, which used LSA-based scoring for state and national K-12 assessments.12 The method was also adopted in library catalog systems, intelligence community text analysis (the SAIC Latent Semantic Indexing Engine was first deployed in 1999), patent prior art retrieval, and electronic discovery software for litigation support.11
In machine learning research the influence of LSA shows up indirectly. Latent factor models for collaborative filtering, the Netflix Prize winners, and modern recommender systems all rest on essentially the same SVD-on-a-sparse-matrix idea. Levy and Goldberg's 2014 result that word2vec's skip-gram with negative sampling implicitly factorises a shifted PMI matrix made the kinship explicit: word2vec is, in a precise sense, LSA with a different cell weighting and a different optimisation procedure.14 GloVe (Pennington et al. 2014) makes the connection more explicit still by directly factorising a weighted log-co-occurrence matrix.20
For state-of-the-art retrieval and topic modelling, LSA has been superseded by neural methods. Sentence-BERT and dense retrievers built on transformer encoders dominate retrieval benchmarks; LDA and its neural-variational descendants dominate topic modelling.2117 Most contemporary NLP papers report LSA only as a baseline.
That does not make LSA obsolete. It is still the natural first thing to try when faced with a new text collection, particularly when computational budgets, training data or expertise rule out a transformer pipeline. It runs on a laptop, requires no labelled data, and gives interpretable, reproducible results. The mathematics it relies on (TF-IDF weighting, sparse matrices, the singular value decomposition, cosine similarity) are foundational pieces of applied linear algebra that turn up everywhere in machine learning, so understanding LSA is understanding several of the core building blocks of the field. As a teaching example it is unusually clean: the entire algorithm fits in a few lines of NumPy, the geometric intuition is visible, and the failure modes (polysemy, bag-of-words ignorance of order) directly motivate the historical sequence of methods that followed.
Bag of words, cosine similarity, embeddings, information retrieval, matrix factorization, natural language processing, singular value decomposition, TF-IDF, word embedding.
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990). Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41(6), 391-407. DOI link. PDF: wordvec.colorado.edu/papers/Deerwester_1990.pdf. ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9 ↩10 ↩11
Landauer, T. K., Foltz, P. W., and Laham, D. (1998). An Introduction to Latent Semantic Analysis. Discourse Processes, 25(2-3), 259-284. tandfonline.com/doi/abs/10.1080/01638539809545028. PDF: wordvec.colorado.edu/papers/Landauer_Foltz_Laham_1998.pdf. ↩ ↩2 ↩3 ↩4 ↩5
Deerwester, S., Dumais, S. T., Furnas, G. W., Harshman, R., Landauer, T. K., Lochbaum, K., and Streeter, L. Computer information retrieval using latent semantic structure. US Patent 4,839,853, filed 15 September 1988, granted 13 June 1989. patents.google.com/patent/US4839853A. ↩ ↩2
ERIC record EJ415308, abstract of Deerwester et al. (1990), "Indexing by Latent Semantic Analysis", Journal of the American Society for Information Science, v41 n6 p391-407. eric.ed.gov/?id=EJ415308. ↩
SciSpace (formerly Typeset) citation record for Deerwester et al. (1990), "Indexing by Latent Semantic Analysis" (13,000+ citations). scispace.com/papers/indexing-by-latent-semantic-analysis-4kzjkh4lw5. ↩
Berry, M. W., Dumais, S. T., and O'Brien, G. W. (1995). Using Linear Algebra for Intelligent Information Retrieval. SIAM Review, 37(4), 573-595. epubs.siam.org/doi/10.1137/1037127. ↩ ↩2 ↩3 ↩4 ↩5 ↩6
Manning, C. D., Raghavan, P., and Schutze, H. (2008). Introduction to Information Retrieval, Chapter 18: "Matrix decompositions and latent semantic indexing". Cambridge University Press. nlp.stanford.edu/IR-book/pdf/18lsi.pdf. ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7
scikit-learn developers. sklearn.decomposition.TruncatedSVD documentation. scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html. ↩ ↩2
Bradford, R. B. (2008). An Empirical Study of Required Dimensionality for Large-Scale Latent Semantic Indexing Applications. In CIKM 2008, 153-162. ↩
Landauer, T. K., McNamara, D. S., Dennis, S., and Kintsch, W. (eds.) (2007). Handbook of Latent Semantic Analysis. Lawrence Erlbaum Associates. ↩
Dumais, S. T. (2004). Latent semantic analysis. Annual Review of Information Science and Technology, 38(1), 188-230. asistdl.onlinelibrary.wiley.com/doi/abs/10.1002/aris.1440380105. ↩ ↩2 ↩3 ↩4
Pearson Knowledge Technologies. WriteToLearn / Intelligent Essay Assessor product overview. pearsonassessments.com/large-scale-assessments/k-12-large-scale-assessments/automated-scoring.html. ↩ ↩2
Foltz, P. W., Laham, D., and Landauer, T. K. (1999). The Intelligent Essay Assessor: Applications to Educational Technology. Interactive Multimedia Electronic Journal of Computer-Enhanced Learning, 1(2). imej.wfu.edu/articles/1999/2/04/index.asp. ↩
Levy, O., and Goldberg, Y. (2014). Neural Word Embedding as Implicit Matrix Factorization. In NeurIPS 2014. papers.nips.cc/paper/5477-neural-word-embedding-as-implicit-matrix-factorization. ↩ ↩2 ↩3
Howard, M. W., and Kahana, M. J. (1999). Contextual Variability and Serial Position Effects in Free Recall. Journal of Experimental Psychology: Learning, Memory, and Cognition, 25(4), 923-941. ↩
Hofmann, T. (1999). Probabilistic Latent Semantic Analysis. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI), 289-296. Also: Hofmann, T. (1999). Probabilistic Latent Semantic Indexing. In Proceedings of the 22nd Annual International ACM SIGIR Conference, 50-57. sigir.org/wp-content/uploads/2017/06/p211.pdf. ↩ ↩2 ↩3
Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3, 993-1022. jmlr.csail.mit.edu/papers/v3/blei03a.html. ↩ ↩2 ↩3 ↩4
Lee, D. D., and Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401, 788-791. nature.com/articles/44565. ↩
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781. arxiv.org/abs/1301.3781. ↩
Pennington, J., Socher, R., and Manning, C. D. (2014). GloVe: Global Vectors for Word Representation. In EMNLP 2014, 1532-1543. nlp.stanford.edu/pubs/glove.pdf. ↩ ↩2
Reimers, N., and Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In EMNLP-IJCNLP 2019, 3982-3992. arxiv.org/abs/1908.10084. ↩ ↩2
Altszyler, E., Sigman, M., Ribeiro, S., and Slezak, D. F. (2017). Comparative study of LSA vs Word2vec embeddings in small corpora: a case study in dreams database. arXiv:1610.01520. arxiv.org/abs/1610.01520. ↩
Halko, N., Martinsson, P. G., and Tropp, J. A. (2011). Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions. SIAM Review, 53(2), 217-288. epubs.siam.org/doi/10.1137/090771806. arxiv.org/abs/0909.4061. ↩
Rehurek, R. Gensim: models.lsimodel - Latent Semantic Indexing documentation. radimrehurek.com/gensim/models/lsimodel.html. ↩ ↩2