Latent semantic analysis (LSA)

Latent semantic analysis (LSA), called latent semantic indexing (LSI) in information retrieval contexts, is an unsupervised technique for analysing the relationships between a set of documents and the terms they contain. The method works by producing a low-dimensional vector representation of both terms and documents, derived from the truncated singular value decomposition of a term-document matrix. Terms that occur in similar documents and documents that share similar terms map to nearby points in the resulting low-dimensional "semantic space", which means synonyms cluster together even when they never co-occur in the same document.[^deerwester1990][^landauer1998]

LSA was developed at Bell Communications Research (Bellcore) in the late 1980s by Scott Deerwester, Susan Dumais, George Furnas, Thomas Landauer, Richard Harshman, Karen Lochbaum and Lynn Streeter. The underlying method was patented in September 1988 (US patent 4,839,853, "Computer information retrieval using latent semantic structure"), and the foundational journal paper, "Indexing by Latent Semantic Analysis", appeared in the Journal of the American Society for Information Science in 1990.[^deerwester1990][^uspto4839853] More than three decades later, LSA remains a widely taught baseline in NLP and information retrieval, a useful low-cost alternative when neural embeddings are out of reach, and a clean illustration of how linear algebra can carry semantic content.

Definition

Let a corpus contain m unique terms across n documents. Build a term-document matrix A of shape m x n whose entry A_ij is some weighted count of how often term i appears in document j. The cells are typically TF-IDF weighted, although raw counts, log-frequency weighting and entropy-based weighting all appear in the literature.[^berry1995][^manning2008] LSA then computes the truncated singular value decomposition of A:

A  ~  U_k Σ_k V_k^T

where k is much smaller than the rank of A (typically a few hundred for collections with tens of thousands of documents).[^deerwester1990][^berry1995] The three factors have direct semantic interpretations:

U_k is m x k. Its rows are k-dimensional term vectors in the latent semantic space.
Σ_k is k x k and diagonal. Its entries are the k largest singular values of A, each measuring how much variance the corresponding latent dimension captures.
V_k^T is k x n. Its columns are k-dimensional document vectors in the same space.

A query q is represented as a pseudo-document, that is, as a m-dimensional bag-of-words (or bag of words) vector with the same TF-IDF weighting as the original columns of A. It is projected into the latent space by

q_hat  =  Σ_k^{-1} U_k^T q

and compared to existing document vectors using cosine similarity.[^manning2008][^deerwester1990] The top-k nearest documents in this low-dimensional space are returned as the search result. Document-document similarity is computed as the cosine of the corresponding columns of Σ_k V_k^T, while term-term similarity uses the rows of U_k Σ_k.

The truncation step is what gives LSA its semantic character. The full rank reproduces A exactly and therefore behaves like a literal keyword match. Truncating to k dimensions discards the smallest singular values, which carry the most idiosyncratic and noisy variation in word usage, and keeps the dominant correlations between term occurrences. A document about "cars" and a document about "automobiles" can end up close in the latent space even if they share no surface vocabulary, because both correlate with the same set of supporting terms (engine, drive, road, fuel) and these correlations dominate the leading singular vectors.[^deerwester1990][^landauer1998]

Mathematical construction step by step

Step 1: Build the term-document matrix

The rows of A correspond to vocabulary terms, the columns to documents. The choice of cell weighting matters more than most beginners expect. Raw counts work poorly because frequent terms dominate. The standard recipe in the original Bellcore work and in Berry, Dumais and O'Brien (1995) is to combine a local term weight (typically log(1 + tf)) with a global term weight, often inverse document frequency or one minus normalised entropy across documents, and then optionally normalise each document column to unit length.[^berry1995][^deerwester1990] Modern implementations such as scikit-learn's TruncatedSVD pipeline almost always feed the SVD a TF-IDF matrix produced by TfidfVectorizer.[^sklearn_tsvd]

Step 2: Compute the truncated SVD

For any real m x n matrix the singular value decomposition factors A as A = U Σ V^T, where U is m x r with orthonormal columns, V is n x r with orthonormal columns, Σ is r x r diagonal with non-negative entries σ_1 >= σ_2 >= ... >= σ_r >= 0, and r is the rank of A.[^manning2008] Keeping the top k singular values and the corresponding columns of U and V gives the best rank-k approximation of A in both the Frobenius norm and the spectral norm, by the Eckart-Young-Mirsky theorem. This is the precise sense in which LSA produces an optimal low-rank summary of the term-document matrix.

Step 3: Embed terms and documents

After truncation, term i is represented by row i of U_k and document j by column j of V_k^T. Multiplying by Σ_k to get U_k Σ_k or Σ_k V_k^T scales each latent dimension by its singular value, which is the standard form for taking cosines in the original Deerwester paper.[^deerwester1990] Geometrically, the original term-document matrix is replaced by its projection onto the k-dimensional subspace that captures the most variance, and all subsequent retrieval is done in that subspace.

Step 4: Fold in queries and new documents

A query is a pseudo-document; it is encoded as an m-vector and projected into the latent space using q_hat = Σ_k^{-1} U_k^T q. New documents that arrive after the SVD has been computed can be "folded in" the same way, although strictly speaking this distorts the orthogonality properties of the basis. For corpora that change a lot, the SVD has to be recomputed periodically or maintained with an incremental SVD algorithm such as Brand's 2003 method.[^manning2008]

Choosing k

The number of latent dimensions k is the only real hyperparameter in LSA. Bradford's 2008 study and Landauer's experiments suggest that around 300 dimensions usually work well for moderate-sized collections, with a useful range of roughly 100 to 500 for typical document sets and up to 1,000 for very large corpora.[^bradford2008][^landauer2008] Berry, Dumais and O'Brien quote 200 to 300 as the standard for retrieval applications.[^berry1995] Manning, Raghavan and Schutze report that LSI experiments at Bellcore in the early 1990s used Lanczos-based partial SVDs to compute decompositions of tens of thousands of documents in roughly a day on the hardware of the time, and that the resulting precision was at or above the median TREC participant for several test collections.[^manning2008]

Values of k that are too small lose discriminating power and collapse genuinely different documents onto the same vector. Values that are too large bring back the noise that the truncation was supposed to remove. There is no universally optimal choice, but most published experiments use values in the low hundreds and report only mild sensitivity within that range.

Why it works: synonymy, polysemy and the curse of dimensionality

In raw bag-of-words information retrieval, two documents that mean the same thing can have nearly disjoint vocabulary. A document that says "car repair" and a document that says "automobile maintenance" share no terms, so their cosine similarity in the original term space is zero. This is the synonymy problem: many words can express the same concept, and direct keyword matching cannot tell that they do.

LSA partially solves synonymy. Because both "car" and "automobile" tend to co-occur with the same supporting terms (engine, mechanic, oil), the SVD discovers a latent dimension that loads strongly on all of them. Documents that use either word end up close in the projected space. The same logic applies to spelling variants, near-synonyms across registers, and translations within bilingual corpora, and is the basis for the cross-lingual LSI experiments first reported by Dumais and colleagues in 1996.[^landauer1998][^dumais2004]

LSA only partially solves the polysemy problem, the converse of synonymy where the same word has multiple meanings. "Bank" can mean a financial institution or the side of a river, but LSA assigns a single vector to "bank" that is essentially an average of all its contextual uses. If both senses occur in the corpus the resulting vector is a compromise that may match neither sense well. This is a fundamental limitation of any one-vector-per-word model and was a major motivation for the contextual embeddings produced by transformer encoders such as BERT, where the vector for "bank" depends on the surrounding sentence.[^landauer1998]

The move from an m-dimensional vocabulary to a k-dimensional latent space also addresses the curse of dimensionality that afflicts naive vector-space retrieval. In the raw term space, document vectors are sparse and high dimensional; cosines are dominated by which words happen to co-occur. After projection into a few hundred dense dimensions, every document has a non-trivial coordinate on every axis, and the geometry of similarity becomes much more informative.

Use cases

Document retrieval (LSI)

The original target application. A query is encoded as a pseudo-document, projected into the latent space, and ranked by cosine similarity against the document vectors. The Bellcore team reported substantial precision improvements over keyword matching on the MED, CISI and CRAN test collections, and Berry, Dumais and O'Brien (1995) document larger-scale experiments on collections of up to 1 million documents using the SVDPACKC software they wrote at the University of Tennessee.[^berry1995][^deerwester1990]

Document similarity and clustering

Document vectors from LSA can be fed into standard clustering algorithms (k-means, hierarchical clustering) or used directly to compute pairwise similarities. Because the latent representation is dense and low dimensional, distance computations are much cheaper than in the raw term space.

Automated essay scoring

Thomas Landauer's group commercialised LSA-based essay scoring through Knowledge Analysis Technologies, which Pearson Education acquired in 2004 to form Pearson Knowledge Technologies. Their Intelligent Essay Assessor (IEA) trains an LSA semantic space on a large reference corpus and then scores student essays by their cosine similarity to a small set of human-graded reference essays. Pearson reports that IEA needs roughly 100 pre-scored essays per prompt to train, compared with 300 to 500 essays per prompt for older systems, and that its scores agree with human raters about as well as human raters agree with each other.[^pearson_iea][^foltz1999] IEA powered Pearson's WriteToLearn educational product and its large-scale automated scoring services for state and national exams.

Collaborative filtering

The term-document matrix is structurally identical to a user-item matrix in recommender systems, which made LSA an early prototype for matrix factorization approaches to collaborative filtering. Latent factor models that became standard for movie and product recommendation, including the SVD-based methods that won the Netflix Prize in 2009, descend directly from this idea.

Term embeddings before the word-embedding era

The rows of U_k are dense vector representations of terms with the property that semantically similar terms have similar vectors. They are, in effect, an early form of word embedding, produced by SVD on co-occurrence statistics rather than by neural training. This is more than a retrospective analogy: Levy and Goldberg's 2014 paper Neural Word Embedding as Implicit Matrix Factorization showed that word2vec's skip-gram with negative sampling is implicitly factorising a shifted pointwise mutual information matrix, putting it in the same mathematical family as LSA on a different cell weighting.[^levy2014]

Other applications

Cross-lingual retrieval, by training a single LSA space on a parallel corpus and using it to match queries in one language to documents in another.[^dumais2004]
Patent prior art search, electronic discovery (eDiscovery), and intelligence analysis, where matching by concept rather than by keyword reduces missed matches.[^dumais2004]
Biological text mining, including assigning MEDLINE abstracts to gene categories.
Source code search and software bug localisation, treating identifiers and comments as terms.
Memory psychology, where Howard and Kahana (1999) showed that LSA cosine similarity between words predicts the probability that they will be recalled together in free recall tasks.[^howard1999]

Comparison with later methods

LSA sits at the head of a long line of techniques for building low-dimensional representations of documents and words. The table below summarises the main relatives.

Method	Year	Type	Mathematical core	Notes
LSA / LSI	1988-1990	Linear, deterministic	Truncated SVD on TF-IDF term-document matrix	Original method, Bellcore patent and Deerwester et al. 1990[^deerwester1990]
pLSA	1999	Probabilistic mixture	Multinomial mixture model trained by EM on word-document counts	Hofmann's UAI/SIGIR 1999 papers; addresses LSA's lack of a generative semantics[^hofmann1999]
LDA	2003	Bayesian generative	Three-level hierarchical Bayesian topic model with Dirichlet priors	Blei, Ng and Jordan 2003 in JMLR; the standard topic model[^blei2003]
NMF	1999	Linear, deterministic	Non-negative factorisation A ~ WH with W,H >= 0	Lee and Seung 1999 in Nature; produces parts-based, interpretable factors[^leeseung1999]
word2vec	2013	Neural, predictive	Shallow neural network predicting context words (CBOW or skip-gram)	Mikolov et al. 2013; equivalent to factorising a shifted PMI matrix[^mikolov2013][^levy2014]
GloVe	2014	Linear, count-based	Weighted least-squares factorisation of log word-word co-occurrence matrix	Pennington et al. 2014; explicitly hybridises count and predict approaches[^pennington2014]
Sentence-BERT	2019	Contextual neural	Siamese BERT fine-tuned on sentence-pair tasks	Reimers and Gurevych 2019; one vector per sentence with full transformer context[^reimers2019]
Modern dense retrievers (DPR, ColBERT, E5, etc.)	2020-present	Contextual neural	Dual-encoder transformers trained on (query, relevant document) pairs	State of the art for retrieval, but require GPUs and large training corpora

The broad direction of travel is from one-shot linear factorisations of count statistics (LSA, NMF) toward probabilistic generative models (pLSA, LDA), and then to neural embeddings that condition on full context (Sentence-BERT, modern dense retrievers). LSA is the simplest and oldest member of this family, and several of the others reduce to LSA in special limits.

Strengths

LSA persists as a teaching staple and a practical baseline because it is genuinely good at certain things.

Theoretically clean. The algorithm is one matrix construction followed by one SVD, with one tunable hyperparameter. There is nothing hidden, no random initialisation, no learning rate, no convergence criteria.
Deterministic and reproducible. The same input matrix and the same k always produce the same output (up to sign flips of singular vectors), in contrast to neural methods where two runs differ.
Fast on commodity hardware. Truncated SVD of a TF-IDF matrix with hundreds of thousands of terms and documents finishes in minutes on a laptop using randomised SVD or ARPACK. No GPU required.
Strong baseline. On classical IR test collections LSA was at or above median TREC performance in the 1990s, and it remains a serious baseline against which neural methods are compared.[^manning2008]
Interpretable mathematics. Practitioners who already understand SVD, PCA and least squares can read off the entire derivation in an afternoon. The dimensions themselves are not always meaningful, but the operations producing them are.
Useful when training data is scarce. Because LSA is unsupervised and requires no labelled examples, it can be applied to any text corpus immediately. Studies have found that for small corpora (under 10 million words) LSA can outperform neural embeddings such as word2vec, which need much more data to converge.[^aclanthology_lsa_w2v]

Weaknesses

LSA is also clearly a 1990s technique with limitations that motivated everything that came after it.

Linear and bag-of-words. The model is a linear projection of a count matrix that ignores word order, syntax and negation. "Cat ate fish" and "fish ate cat" have identical LSA vectors.
One vector per word. No contextualisation: "bank" gets a single vector that averages the financial and geographical senses. This was the headline failure that contextual transformer encoders such as ELMo and BERT were designed to fix.
Latent dimensions lack clear meaning. Unlike NMF or LDA, the SVD's components can be negative and do not correspond to interpretable topics. A latent dimension might be best described as "0.4 x sport - 0.2 x finance + 0.1 x weather".
Hard to update incrementally. Adding new documents in principle requires recomputing the full SVD. Incremental and online SVD algorithms exist (Brand 2003, Gensim's online LSI) but they are approximate, and folding in many new documents distorts the basis over time.
One global k. A single dimensionality has to fit the entire corpus, even if some topics are richly represented and others are sparse.
Probabilistic mismatch. The SVD implicitly assumes Gaussian noise on the entries of the term-document matrix, but term counts are non-negative integers that are better modelled by Poisson or multinomial distributions. This mismatch was the explicit motivation for Hofmann's pLSA (1999) and Blei, Ng and Jordan's LDA (2003).[^hofmann1999][^blei2003]
No state-of-the-art performance. On modern retrieval benchmarks, dense neural retrievers and re-ranking with cross-encoders dominate LSA by large margins.

Computational practice

For anything beyond toy corpora, the cost of LSA is dominated by the SVD step. Three families of algorithms are used in practice.

Lanczos-based partial SVD. ARPACK and its successors compute the top k singular triples of a sparse matrix by Lanczos bidiagonalisation, requiring only matrix-vector products. This is the route used in MATLAB's svds and SciPy's scipy.sparse.linalg.svds. It was the standard at Bellcore in the 1990s; Manning, Raghavan and Schutze quote experiments where LSI on tens of thousands of documents took roughly a day on a single machine using this approach.[^manning2008]

Randomised SVD. Halko, Martinsson and Tropp's 2011 SIAM Review paper Finding Structure with Randomness introduced a modular framework that approximates the top k singular vectors using a small number of random projections. The method requires only 2(q+1) passes over the matrix and is markedly faster than Lanczos for large matrices with rapidly decaying singular spectra.[^halko2011] It is the default solver in scikit-learn's TruncatedSVD.

Incremental SVD. Brand's 2003 algorithm and its descendants update an existing SVD when rows or columns are added or removed, without recomputing from scratch. Gensim's LsiModel ships an online version that can stream very large corpora through bounded memory, suitable for collections that cannot fit in RAM.[^gensim_lsi]

Software

Library	Function or class	Notes
scikit-learn (Python)	`sklearn.decomposition.TruncatedSVD`	Default LSA implementation; supports ARPACK and randomised SVD solvers; recommended `n_components=100` for LSA[^sklearn_tsvd]
Gensim (Python)	`gensim.models.LsiModel`	Streaming, online, distributed-friendly; built for corpora that exceed RAM[^gensim_lsi]
MATLAB	`svds`, `svd`	`svds` is the standard partial-SVD routine for sparse matrices
R	`irlba` package	Implicitly Restarted Lanczos Bidiagonalisation; the standard partial-SVD package in R
SVDPACKC	C library	The original Bellcore-Tennessee software used in Berry, Dumais, O'Brien 1995[^berry1995]
Apache Spark	`RowMatrix.computeSVD`	Distributed truncated SVD for very large matrices

Notable applications and historical impact

In addition to its founding role in concept-based information retrieval, LSA found significant use in education through Pearson Knowledge Technologies' Intelligent Essay Assessor and the WriteToLearn product line, which used LSA-based scoring for state and national K-12 assessments.[^pearson_iea] The method was also adopted in library catalog systems, intelligence community text analysis (the SAIC Latent Semantic Indexing Engine was first deployed in 1999), patent prior art retrieval, and electronic discovery software for litigation support.[^dumais2004]

In machine learning research the influence of LSA shows up indirectly. Latent factor models for collaborative filtering, the Netflix Prize winners, and modern recommender systems all rest on essentially the same SVD-on-a-sparse-matrix idea. Levy and Goldberg's 2014 result that word2vec's skip-gram with negative sampling implicitly factorises a shifted PMI matrix made the kinship explicit: word2vec is, in a precise sense, LSA with a different cell weighting and a different optimisation procedure.[^levy2014] GloVe (Pennington et al. 2014) makes the connection more explicit still by directly factorising a weighted log-co-occurrence matrix.[^pennington2014]

Critique and current status

For state-of-the-art retrieval and topic modelling, LSA has been superseded by neural methods. Sentence-BERT and dense retrievers built on transformer encoders dominate retrieval benchmarks; LDA and its neural-variational descendants dominate topic modelling.[^reimers2019][^blei2003] Most contemporary NLP papers report LSA only as a baseline.

That does not make LSA obsolete. It is still the natural first thing to try when faced with a new text collection, particularly when computational budgets, training data or expertise rule out a transformer pipeline. It runs on a laptop, requires no labelled data, and gives interpretable, reproducible results. The mathematics it relies on (TF-IDF weighting, sparse matrices, the singular value decomposition, cosine similarity) are foundational pieces of applied linear algebra that turn up everywhere in machine learning, so understanding LSA is understanding several of the core building blocks of the field. As a teaching example it is unusually clean: the entire algorithm fits in a few lines of NumPy, the geometric intuition is visible, and the failure modes (polysemy, bag-of-words ignorance of order) directly motivate the historical sequence of methods that followed.

References

Latent semantic analysis (LSA)

Definition

Mathematical construction step by step

Step 1: Build the term-document matrix

Step 2: Compute the truncated SVD

Step 3: Embed terms and documents

Step 4: Fold in queries and new documents

Choosing k

Why it works: synonymy, polysemy and the curse of dimensionality

Use cases

Document retrieval (LSI)

Document similarity and clustering

Automated essay scoring

Collaborative filtering

Term embeddings before the word-embedding era

Other applications

Comparison with later methods

Strengths

Weaknesses

Computational practice

Software

Notable applications and historical impact

Critique and current status

See also

References

Improve this article

Related Articles

MTEB (Massive Text Embedding Benchmark)

Similarity Measure

LlamaIndex

Embeddings

Information Retrieval

Question answering

Latent semantic analysis (LSA)

Definition

Mathematical construction step by step

Step 1: Build the term-document matrix

Step 2: Compute the truncated SVD

Step 3: Embed terms and documents

Step 4: Fold in queries and new documents

Choosing k

Why it works: synonymy, polysemy and the curse of dimensionality

Use cases

Document retrieval (LSI)

Document similarity and clustering

Automated essay scoring

Collaborative filtering

Term embeddings before the word-embedding era

Other applications

Comparison with later methods

Strengths

Weaknesses

Computational practice

Software

Notable applications and historical impact

Critique and current status

See also

References

Related Articles

MTEB (Massive Text Embedding Benchmark)

Similarity Measure

LlamaIndex

Embeddings

Information Retrieval

Question answering