# Latent semantic analysis (LSA)

> Source: https://aiwiki.ai/wiki/latent_semantic_analysis
> Updated: 2026-06-24
> Categories: Information Retrieval, Mathematics, Natural Language Processing
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**Latent semantic analysis (LSA)**, called **latent semantic indexing (LSI)** in [information retrieval](/wiki/information_retrieval) contexts, is an unsupervised technique that maps both terms and documents into a shared low-dimensional vector space by taking the truncated [singular value decomposition](/wiki/singular_value_decomposition) of a term-document matrix. Because the decomposition keeps only the dominant correlations in word usage, terms that mean the same thing (for example "car" and "automobile") land near each other in the reduced space even when they never appear in the same document, which lets retrieval match by concept rather than by exact keyword. LSA was introduced by a team at Bell Communications Research (Bellcore) in a 1988 patent and the 1990 paper "Indexing by Latent Semantic Analysis", and it is widely regarded as a direct precursor to modern word and document [embeddings](/wiki/embeddings).[^deerwester1990][^landauer1998][^uspto4839853]

The founding paper described the method plainly: it "takes advantage of implicit higher-order structure in the association of terms with documents ('semantic structure')" using "singular-value decomposition, in which a large term by document matrix is decomposed into a set of ca. 100 orthogonal factors from which the original matrix can be approximated by linear combination."[^deerwester1990][^eric_ej415308] That 1990 paper is among the most cited works in information retrieval; the indexing service SciSpace records more than 13,000 citations for it.[^scispace_cites]

LSA was developed in the late 1980s by Scott Deerwester, Susan Dumais, George Furnas, Thomas Landauer, Richard Harshman, Karen Lochbaum and Lynn Streeter. The underlying method was filed as a US patent on 15 September 1988 (US patent 4,839,853, "Computer information retrieval using latent semantic structure", granted 13 June 1989), and the foundational journal paper appeared in the *Journal of the American Society for Information Science* in 1990, in volume 41, issue 6, pages 391-407.[^deerwester1990][^uspto4839853] More than three decades later, LSA remains a widely taught baseline in [natural language processing](/wiki/natural_language_processing) and information retrieval, a useful low-cost alternative when neural [embeddings](/wiki/embeddings) are out of reach, and a clean illustration of how linear algebra can carry semantic content.

## What is latent semantic analysis?

Let a corpus contain *m* unique terms across *n* documents. Build a term-document matrix **A** of shape *m* x *n* whose entry A_ij is some weighted count of how often term *i* appears in document *j*. The cells are typically TF-IDF weighted, although raw counts, log-frequency weighting and entropy-based weighting all appear in the literature.[^berry1995][^manning2008] LSA then computes the **truncated singular value decomposition** of **A**:

```
A  ~  U_k Σ_k V_k^T
```

where *k* is much smaller than the rank of **A** (typically a few hundred for collections with tens of thousands of documents; the original Bellcore work used about 100 factors).[^deerwester1990][^berry1995] The three factors have direct semantic interpretations:

- **U_k** is *m* x *k*. Its rows are *k*-dimensional term vectors in the latent semantic space.
- **Σ_k** is *k* x *k* and diagonal. Its entries are the *k* largest singular values of **A**, each measuring how much variance the corresponding latent dimension captures.
- **V_k^T** is *k* x *n*. Its columns are *k*-dimensional document vectors in the same space.

A query *q* is represented as a pseudo-document, that is, as a *m*-dimensional bag-of-words (or [bag of words](/wiki/bag_of_words)) vector with the same TF-IDF weighting as the original columns of **A**. It is projected into the latent space by

```
q_hat  =  Σ_k^{-1} U_k^T q
```

and compared to existing document vectors using [cosine similarity](/wiki/cosine_similarity).[^manning2008][^deerwester1990] The top-*k* nearest documents in this low-dimensional space are returned as the search result. Document-document similarity is computed as the cosine of the corresponding columns of Σ_k V_k^T, while term-term similarity uses the rows of U_k Σ_k.

The truncation step is what gives LSA its semantic character. The full rank reproduces **A** exactly and therefore behaves like a literal keyword match. Truncating to *k* dimensions discards the smallest singular values, which carry the most idiosyncratic and noisy variation in word usage, and keeps the dominant correlations between term occurrences. A document about "cars" and a document about "automobiles" can end up close in the latent space even if they share no surface vocabulary, because both correlate with the same set of supporting terms (engine, drive, road, fuel) and these correlations dominate the leading singular vectors.[^deerwester1990][^landauer1998]

The psychologist Thomas Landauer, one of the inventors, framed LSA as more than an engineering trick. In the widely read 1998 introduction, he and his coauthors defined it as "a theory and method for extracting and representing the contextual-usage meaning of words by statistical computations applied to a large corpus of text."[^landauer1998]

## How is the LSA model constructed step by step?

### Step 1: Build the term-document matrix

The rows of **A** correspond to vocabulary terms, the columns to documents. The choice of cell weighting matters more than most beginners expect. Raw counts work poorly because frequent terms dominate. The standard recipe in the original Bellcore work and in Berry, Dumais and O'Brien (1995) is to combine a local term weight (typically log(1 + tf)) with a global term weight, often inverse document frequency or one minus normalised entropy across documents, and then optionally normalise each document column to unit length.[^berry1995][^deerwester1990] Modern implementations such as scikit-learn's TruncatedSVD pipeline almost always feed the SVD a [TF-IDF](/wiki/tf_idf) matrix produced by `TfidfVectorizer`.[^sklearn_tsvd]

### Step 2: Compute the truncated SVD

For any real *m* x *n* matrix the singular value decomposition factors **A** as **A** = U Σ V^T, where U is *m* x *r* with orthonormal columns, V is *n* x *r* with orthonormal columns, Σ is *r* x *r* diagonal with non-negative entries σ_1 >= σ_2 >= ... >= σ_r >= 0, and *r* is the rank of **A**.[^manning2008] Keeping the top *k* singular values and the corresponding columns of U and V gives the **best rank-*k* approximation of A in both the Frobenius norm and the spectral norm**, by the Eckart-Young-Mirsky theorem. This is the precise sense in which LSA produces an optimal low-rank summary of the term-document matrix.

### Step 3: Embed terms and documents

After truncation, term *i* is represented by row *i* of U_k and document *j* by column *j* of V_k^T. Multiplying by Σ_k to get U_k Σ_k or Σ_k V_k^T scales each latent dimension by its singular value, which is the standard form for taking cosines in the original Deerwester paper.[^deerwester1990] Geometrically, the original term-document matrix is replaced by its projection onto the *k*-dimensional subspace that captures the most variance, and all subsequent retrieval is done in that subspace.

### Step 4: Fold in queries and new documents

A query is a pseudo-document; it is encoded as an *m*-vector and projected into the latent space using q_hat = Σ_k^{-1} U_k^T q. New documents that arrive after the SVD has been computed can be "folded in" the same way, although strictly speaking this distorts the orthogonality properties of the basis. For corpora that change a lot, the SVD has to be recomputed periodically or maintained with an incremental SVD algorithm such as Brand's 2003 method.[^manning2008]

## How many dimensions (k) does LSA use?

The number of latent dimensions *k* is the only real hyperparameter in LSA. The original Bellcore paper worked with roughly 100 factors.[^deerwester1990] Bradford's 2008 study and Landauer's experiments suggest that around 300 dimensions usually work well for moderate-sized collections, with a useful range of roughly 100 to 500 for typical document sets and up to 1,000 for very large corpora.[^bradford2008][^landauer2008] Berry, Dumais and O'Brien quote 200 to 300 as the standard for retrieval applications.[^berry1995] Manning, Raghavan and Schutze report that LSI experiments at Bellcore in the early 1990s used Lanczos-based partial SVDs to compute decompositions of tens of thousands of documents in roughly a day on the hardware of the time, and that the resulting precision was at or above the median TREC participant for several test collections.[^manning2008]

Values of *k* that are too small lose discriminating power and collapse genuinely different documents onto the same vector. Values that are too large bring back the noise that the truncation was supposed to remove. There is no universally optimal choice, but most published experiments use values in the low hundreds and report only mild sensitivity within that range.

## Why does LSA work? Synonymy, polysemy and the curse of dimensionality

In raw bag-of-words [information retrieval](/wiki/information_retrieval), two documents that mean the same thing can have nearly disjoint vocabulary. A document that says "car repair" and a document that says "automobile maintenance" share no terms, so their cosine similarity in the original term space is zero. This is the **synonymy problem**: many words can express the same concept, and direct keyword matching cannot tell that they do.

LSA partially solves synonymy. Because both "car" and "automobile" tend to co-occur with the same supporting terms (engine, mechanic, oil), the SVD discovers a latent dimension that loads strongly on all of them. Documents that use either word end up close in the projected space. The same logic applies to spelling variants, near-synonyms across registers, and translations within bilingual corpora, and is the basis for the cross-lingual LSI experiments first reported by Dumais and colleagues in 1996.[^landauer1998][^dumais2004]

LSA only partially solves the **polysemy problem**, the converse of synonymy where the same word has multiple meanings. "Bank" can mean a financial institution or the side of a river, but LSA assigns a single vector to "bank" that is essentially an average of all its contextual uses. If both senses occur in the corpus the resulting vector is a compromise that may match neither sense well. This is a fundamental limitation of any one-vector-per-word model and was a major motivation for the contextual embeddings produced by [transformer](/wiki/transformer) encoders such as BERT, where the vector for "bank" depends on the surrounding sentence.[^landauer1998]

The move from an *m*-dimensional vocabulary to a *k*-dimensional latent space also addresses the **curse of dimensionality** that afflicts naive vector-space retrieval. In the raw term space, document vectors are sparse and high dimensional; cosines are dominated by which words happen to co-occur. After projection into a few hundred dense dimensions, every document has a non-trivial coordinate on every axis, and the geometry of similarity becomes much more informative.

## What is LSA used for?

### Document retrieval (LSI)

The original target application. A query is encoded as a pseudo-document, projected into the latent space, and ranked by cosine similarity against the document vectors. The Bellcore team reported substantial precision improvements over keyword matching on the MED, CISI and CRAN test collections, and Berry, Dumais and O'Brien (1995) document larger-scale experiments on collections of up to 1 million documents using the SVDPACKC software they wrote at the University of Tennessee.[^berry1995][^deerwester1990]

### Document similarity and clustering

Document vectors from LSA can be fed into standard clustering algorithms (k-means, hierarchical clustering) or used directly to compute pairwise similarities. Because the latent representation is dense and low dimensional, distance computations are much cheaper than in the raw term space.

### Automated essay scoring

Thomas Landauer's group commercialised LSA-based essay scoring through Knowledge Analysis Technologies, which Pearson Education acquired in 2004 to form Pearson Knowledge Technologies. Their **Intelligent Essay Assessor (IEA)** trains an LSA semantic space on a large reference corpus and then scores student essays by their cosine similarity to a small set of human-graded reference essays. Pearson reports that IEA needs roughly 100 pre-scored essays per prompt to train, compared with 300 to 500 essays per prompt for older systems, and that its scores agree with human raters about as well as human raters agree with each other.[^pearson_iea][^foltz1999] IEA powered Pearson's WriteToLearn educational product and its large-scale automated scoring services for state and national exams.

### Collaborative filtering

The term-document matrix is structurally identical to a user-item matrix in recommender systems, which made LSA an early prototype for [matrix factorization](/wiki/matrix_factorization) approaches to collaborative filtering. Latent factor models that became standard for movie and product recommendation, including the SVD-based methods that won the Netflix Prize in 2009, descend directly from this idea.

### Term embeddings before the word-embedding era

The rows of U_k are dense vector representations of terms with the property that semantically similar terms have similar vectors. They are, in effect, an early form of [word embedding](/wiki/word_embedding), produced by SVD on co-occurrence statistics rather than by neural training. This is more than a retrospective analogy: Levy and Goldberg's 2014 paper *Neural Word Embedding as Implicit Matrix Factorization* showed that word2vec's skip-gram with negative sampling is implicitly factorising a shifted pointwise mutual information matrix, putting it in the same mathematical family as LSA on a different cell weighting.[^levy2014]

### Other applications

- Cross-lingual retrieval, by training a single LSA space on a parallel corpus and using it to match queries in one language to documents in another.[^dumais2004]
- Patent prior art search, electronic discovery (eDiscovery), and intelligence analysis, where matching by concept rather than by keyword reduces missed matches.[^dumais2004]
- Biological text mining, including assigning MEDLINE abstracts to gene categories.
- Source code search and software bug localisation, treating identifiers and comments as terms.
- Memory psychology, where Howard and Kahana (1999) showed that LSA cosine similarity between words predicts the probability that they will be recalled together in free recall tasks.[^howard1999]

## How does LSA differ from pLSA, LDA and neural embeddings?

LSA sits at the head of a long line of techniques for building low-dimensional representations of documents and words. The two most-cited probabilistic successors, probabilistic LSA (pLSA) and latent Dirichlet allocation (LDA), were motivated explicitly by limitations of LSA's purely linear-algebraic formulation. pLSA replaces the SVD with a latent-class statistical model trained by expectation maximisation, while LDA adds a full Bayesian generative story with Dirichlet priors over topics. The table below summarises the main relatives.

| Method | Year | Type | Mathematical core | Notes |
|---|---|---|---|---|
| LSA / LSI | 1988-1990 | Linear, deterministic | Truncated SVD on TF-IDF term-document matrix | Original method, Bellcore patent and Deerwester et al. 1990[^deerwester1990] |
| pLSA | 1999 | Probabilistic mixture | Multinomial mixture model trained by EM on word-document counts | Hofmann's UAI/SIGIR 1999 papers; addresses LSA's lack of a generative semantics[^hofmann1999] |
| LDA | 2003 | Bayesian generative | Three-level hierarchical Bayesian topic model with Dirichlet priors | Blei, Ng and Jordan 2003 in JMLR; the standard topic model[^blei2003] |
| NMF | 1999 | Linear, deterministic | Non-negative factorisation A ~ WH with W,H >= 0 | Lee and Seung 1999 in *Nature*; produces parts-based, interpretable factors[^leeseung1999] |
| word2vec | 2013 | Neural, predictive | Shallow neural network predicting context words (CBOW or skip-gram) | Mikolov et al. 2013; equivalent to factorising a shifted PMI matrix[^mikolov2013][^levy2014] |
| GloVe | 2014 | Linear, count-based | Weighted least-squares factorisation of log word-word co-occurrence matrix | Pennington et al. 2014; explicitly hybridises count and predict approaches[^pennington2014] |
| Sentence-BERT | 2019 | Contextual neural | Siamese BERT fine-tuned on sentence-pair tasks | Reimers and Gurevych 2019; one vector per sentence with full transformer context[^reimers2019] |
| Modern dense retrievers (DPR, ColBERT, E5, etc.) | 2020-present | Contextual neural | Dual-encoder transformers trained on (query, relevant document) pairs | State of the art for retrieval, but require GPUs and large training corpora |

The central contrast is statistical versus algebraic. LSA is model-free: it produces a low-dimensional representation by linear algebra alone, with orthogonal latent axes and no probabilistic interpretation. Hofmann's pLSA (1999) instead treats topics as multinomial word distributions, allows them to be non-orthogonal, and fits them by maximum likelihood, which makes the resulting factors directly interpretable as probabilities. Blei, Ng and Jordan's LDA (2003) goes further, modelling the entire data-generating process with Dirichlet priors so that the topic mixtures generalise to unseen documents.[^hofmann1999][^blei2003] The broad direction of travel is from one-shot linear factorisations of count statistics (LSA, NMF) toward probabilistic generative models (pLSA, LDA), and then to neural embeddings that condition on full context (Sentence-BERT, modern dense retrievers). LSA is the simplest and oldest member of this family, and several of the others reduce to LSA in special limits.

## What are the strengths of LSA?

LSA persists as a teaching staple and a practical baseline because it is genuinely good at certain things.

- **Theoretically clean.** The algorithm is one matrix construction followed by one SVD, with one tunable hyperparameter. There is nothing hidden, no random initialisation, no learning rate, no convergence criteria.
- **Deterministic and reproducible.** The same input matrix and the same *k* always produce the same output (up to sign flips of singular vectors), in contrast to neural methods where two runs differ.
- **Fast on commodity hardware.** Truncated SVD of a TF-IDF matrix with hundreds of thousands of terms and documents finishes in minutes on a laptop using randomised SVD or ARPACK. No GPU required.
- **Strong baseline.** On classical IR test collections LSA was at or above median TREC performance in the 1990s, and it remains a serious baseline against which neural methods are compared.[^manning2008]
- **Interpretable mathematics.** Practitioners who already understand SVD, PCA and least squares can read off the entire derivation in an afternoon. The dimensions themselves are not always meaningful, but the operations producing them are.
- **Useful when training data is scarce.** Because LSA is unsupervised and requires no labelled examples, it can be applied to any text corpus immediately. Studies have found that for small corpora (under 10 million words) LSA can outperform neural embeddings such as word2vec, which need much more data to converge.[^aclanthology_lsa_w2v]

## What are the weaknesses of LSA?

LSA is also clearly a 1990s technique with limitations that motivated everything that came after it.

- **Linear and bag-of-words.** The model is a linear projection of a count matrix that ignores word order, syntax and negation. "Cat ate fish" and "fish ate cat" have identical LSA vectors.
- **One vector per word.** No contextualisation: "bank" gets a single vector that averages the financial and geographical senses. This was the headline failure that contextual transformer encoders such as ELMo and BERT were designed to fix.
- **Latent dimensions lack clear meaning.** Unlike NMF or LDA, the SVD's components can be negative and do not correspond to interpretable topics. A latent dimension might be best described as "0.4 x sport - 0.2 x finance + 0.1 x weather".
- **Hard to update incrementally.** Adding new documents in principle requires recomputing the full SVD. Incremental and online SVD algorithms exist (Brand 2003, Gensim's online LSI) but they are approximate, and folding in many new documents distorts the basis over time.
- **One global *k*.** A single dimensionality has to fit the entire corpus, even if some topics are richly represented and others are sparse.
- **Probabilistic mismatch.** The SVD implicitly assumes Gaussian noise on the entries of the term-document matrix, but term counts are non-negative integers that are better modelled by Poisson or multinomial distributions. This mismatch was the explicit motivation for Hofmann's pLSA (1999) and Blei, Ng and Jordan's LDA (2003).[^hofmann1999][^blei2003]
- **No state-of-the-art performance.** On modern retrieval benchmarks, dense neural retrievers and re-ranking with cross-encoders dominate LSA by large margins.

## Computational practice

For anything beyond toy corpora, the cost of LSA is dominated by the SVD step. Three families of algorithms are used in practice.

**Lanczos-based partial SVD.** ARPACK and its successors compute the top *k* singular triples of a sparse matrix by Lanczos bidiagonalisation, requiring only matrix-vector products. This is the route used in MATLAB's `svds` and SciPy's `scipy.sparse.linalg.svds`. It was the standard at Bellcore in the 1990s; Manning, Raghavan and Schutze quote experiments where LSI on tens of thousands of documents took roughly a day on a single machine using this approach.[^manning2008]

**Randomised SVD.** Halko, Martinsson and Tropp's 2011 SIAM Review paper *Finding Structure with Randomness* introduced a modular framework that approximates the top *k* singular vectors using a small number of random projections. The method requires only 2(q+1) passes over the matrix and is markedly faster than Lanczos for large matrices with rapidly decaying singular spectra.[^halko2011] It is the default solver in scikit-learn's `TruncatedSVD`.

**Incremental SVD.** Brand's 2003 algorithm and its descendants update an existing SVD when rows or columns are added or removed, without recomputing from scratch. Gensim's `LsiModel` ships an online version that can stream very large corpora through bounded memory, suitable for collections that cannot fit in RAM.[^gensim_lsi]

## Software

| Library | Function or class | Notes |
|---|---|---|
| scikit-learn (Python) | `sklearn.decomposition.TruncatedSVD` | Default LSA implementation; supports ARPACK and randomised SVD solvers; recommended `n_components=100` for LSA[^sklearn_tsvd] |
| Gensim (Python) | `gensim.models.LsiModel` | Streaming, online, distributed-friendly; built for corpora that exceed RAM[^gensim_lsi] |
| MATLAB | `svds`, `svd` | `svds` is the standard partial-SVD routine for sparse matrices |
| R | `irlba` package | Implicitly Restarted Lanczos Bidiagonalisation; the standard partial-SVD package in R |
| SVDPACKC | C library | The original Bellcore-Tennessee software used in Berry, Dumais, O'Brien 1995[^berry1995] |
| Apache Spark | `RowMatrix.computeSVD` | Distributed truncated SVD for very large matrices |

## Notable applications and historical impact

In addition to its founding role in concept-based information retrieval, LSA found significant use in education through Pearson Knowledge Technologies' Intelligent Essay Assessor and the WriteToLearn product line, which used LSA-based scoring for state and national K-12 assessments.[^pearson_iea] The method was also adopted in library catalog systems, intelligence community text analysis (the SAIC Latent Semantic Indexing Engine was first deployed in 1999), patent prior art retrieval, and electronic discovery software for litigation support.[^dumais2004]

In machine learning research the influence of LSA shows up indirectly. Latent factor models for collaborative filtering, the Netflix Prize winners, and modern recommender systems all rest on essentially the same SVD-on-a-sparse-matrix idea. Levy and Goldberg's 2014 result that word2vec's skip-gram with negative sampling implicitly factorises a shifted PMI matrix made the kinship explicit: word2vec is, in a precise sense, LSA with a different cell weighting and a different optimisation procedure.[^levy2014] GloVe (Pennington et al. 2014) makes the connection more explicit still by directly factorising a weighted log-co-occurrence matrix.[^pennington2014]

## Critique and current status

For state-of-the-art retrieval and topic modelling, LSA has been superseded by neural methods. Sentence-BERT and dense retrievers built on transformer encoders dominate retrieval benchmarks; LDA and its neural-variational descendants dominate topic modelling.[^reimers2019][^blei2003] Most contemporary NLP papers report LSA only as a baseline.

That does not make LSA obsolete. It is still the natural first thing to try when faced with a new text collection, particularly when computational budgets, training data or expertise rule out a transformer pipeline. It runs on a laptop, requires no labelled data, and gives interpretable, reproducible results. The mathematics it relies on (TF-IDF weighting, sparse matrices, the singular value decomposition, cosine similarity) are foundational pieces of applied linear algebra that turn up everywhere in machine learning, so understanding LSA is understanding several of the core building blocks of the field. As a teaching example it is unusually clean: the entire algorithm fits in a few lines of NumPy, the geometric intuition is visible, and the failure modes (polysemy, bag-of-words ignorance of order) directly motivate the historical sequence of methods that followed.

## See also

[Bag of words](/wiki/bag_of_words), [cosine similarity](/wiki/cosine_similarity), [embeddings](/wiki/embeddings), [information retrieval](/wiki/information_retrieval), [matrix factorization](/wiki/matrix_factorization), [natural language processing](/wiki/natural_language_processing), [singular value decomposition](/wiki/singular_value_decomposition), [TF-IDF](/wiki/tf_idf), [word embedding](/wiki/word_embedding).

## References

[^deerwester1990]: Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990). Indexing by Latent Semantic Analysis. *Journal of the American Society for Information Science*, 41(6), 391-407. [DOI link](https://asistdl.onlinelibrary.wiley.com/doi/10.1002/(SICI)1097-4571(199009)41:6%3C391::AID-ASI1%3E3.0.CO;2-9). PDF: [wordvec.colorado.edu/papers/Deerwester_1990.pdf](http://wordvec.colorado.edu/papers/Deerwester_1990.pdf).
[^eric_ej415308]: ERIC record EJ415308, abstract of Deerwester et al. (1990), "Indexing by Latent Semantic Analysis", *Journal of the American Society for Information Science*, v41 n6 p391-407. [eric.ed.gov/?id=EJ415308](https://eric.ed.gov/?id=EJ415308).
[^scispace_cites]: SciSpace (formerly Typeset) citation record for Deerwester et al. (1990), "Indexing by Latent Semantic Analysis" (13,000+ citations). [scispace.com/papers/indexing-by-latent-semantic-analysis-4kzjkh4lw5](https://scispace.com/papers/indexing-by-latent-semantic-analysis-4kzjkh4lw5).
[^uspto4839853]: Deerwester, S., Dumais, S. T., Furnas, G. W., Harshman, R., Landauer, T. K., Lochbaum, K., and Streeter, L. *Computer information retrieval using latent semantic structure*. US Patent 4,839,853, filed 15 September 1988, granted 13 June 1989. [patents.google.com/patent/US4839853A](https://patents.google.com/patent/US4839853A/en).
[^landauer1998]: Landauer, T. K., Foltz, P. W., and Laham, D. (1998). An Introduction to Latent Semantic Analysis. *Discourse Processes*, 25(2-3), 259-284. [tandfonline.com/doi/abs/10.1080/01638539809545028](https://www.tandfonline.com/doi/abs/10.1080/01638539809545028). PDF: [wordvec.colorado.edu/papers/Landauer_Foltz_Laham_1998.pdf](http://wordvec.colorado.edu/papers/Landauer_Foltz_Laham_1998.pdf).
[^berry1995]: Berry, M. W., Dumais, S. T., and O'Brien, G. W. (1995). Using Linear Algebra for Intelligent Information Retrieval. *SIAM Review*, 37(4), 573-595. [epubs.siam.org/doi/10.1137/1037127](https://epubs.siam.org/doi/10.1137/1037127).
[^manning2008]: Manning, C. D., Raghavan, P., and Schutze, H. (2008). *Introduction to Information Retrieval*, Chapter 18: "Matrix decompositions and latent semantic indexing". Cambridge University Press. [nlp.stanford.edu/IR-book/pdf/18lsi.pdf](https://nlp.stanford.edu/IR-book/pdf/18lsi.pdf).
[^hofmann1999]: Hofmann, T. (1999). Probabilistic Latent Semantic Analysis. In *Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI)*, 289-296. Also: Hofmann, T. (1999). Probabilistic Latent Semantic Indexing. In *Proceedings of the 22nd Annual International ACM SIGIR Conference*, 50-57. [sigir.org/wp-content/uploads/2017/06/p211.pdf](https://sigir.org/wp-content/uploads/2017/06/p211.pdf).
[^blei2003]: Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichlet Allocation. *Journal of Machine Learning Research*, 3, 993-1022. [jmlr.csail.mit.edu/papers/v3/blei03a.html](https://jmlr.csail.mit.edu/papers/v3/blei03a.html).
[^leeseung1999]: Lee, D. D., and Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. *Nature*, 401, 788-791. [nature.com/articles/44565](https://www.nature.com/articles/44565).
[^mikolov2013]: Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. *arXiv:1301.3781*. [arxiv.org/abs/1301.3781](https://arxiv.org/abs/1301.3781).
[^pennington2014]: Pennington, J., Socher, R., and Manning, C. D. (2014). GloVe: Global Vectors for Word Representation. In *EMNLP 2014*, 1532-1543. [nlp.stanford.edu/pubs/glove.pdf](https://nlp.stanford.edu/pubs/glove.pdf).
[^reimers2019]: Reimers, N., and Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In *EMNLP-IJCNLP 2019*, 3982-3992. [arxiv.org/abs/1908.10084](https://arxiv.org/abs/1908.10084).
[^levy2014]: Levy, O., and Goldberg, Y. (2014). Neural Word Embedding as Implicit Matrix Factorization. In *NeurIPS 2014*. [papers.nips.cc/paper/5477-neural-word-embedding-as-implicit-matrix-factorization](https://papers.nips.cc/paper/5477-neural-word-embedding-as-implicit-matrix-factorization).
[^halko2011]: Halko, N., Martinsson, P. G., and Tropp, J. A. (2011). Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions. *SIAM Review*, 53(2), 217-288. [epubs.siam.org/doi/10.1137/090771806](https://epubs.siam.org/doi/10.1137/090771806). [arxiv.org/abs/0909.4061](https://arxiv.org/abs/0909.4061).
[^dumais2004]: Dumais, S. T. (2004). Latent semantic analysis. *Annual Review of Information Science and Technology*, 38(1), 188-230. [asistdl.onlinelibrary.wiley.com/doi/abs/10.1002/aris.1440380105](https://asistdl.onlinelibrary.wiley.com/doi/abs/10.1002/aris.1440380105).
[^foltz1999]: Foltz, P. W., Laham, D., and Landauer, T. K. (1999). The Intelligent Essay Assessor: Applications to Educational Technology. *Interactive Multimedia Electronic Journal of Computer-Enhanced Learning*, 1(2). [imej.wfu.edu/articles/1999/2/04/index.asp](http://imej.wfu.edu/articles/1999/2/04/index.asp).
[^pearson_iea]: Pearson Knowledge Technologies. *WriteToLearn / Intelligent Essay Assessor* product overview. [pearsonassessments.com/large-scale-assessments/k-12-large-scale-assessments/automated-scoring.html](https://www.pearsonassessments.com/large-scale-assessments/k-12-large-scale-assessments/automated-scoring.html).
[^sklearn_tsvd]: scikit-learn developers. *sklearn.decomposition.TruncatedSVD* documentation. [scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html).
[^gensim_lsi]: Rehurek, R. *Gensim: models.lsimodel - Latent Semantic Indexing* documentation. [radimrehurek.com/gensim/models/lsimodel.html](https://radimrehurek.com/gensim/models/lsimodel.html).
[^bradford2008]: Bradford, R. B. (2008). An Empirical Study of Required Dimensionality for Large-Scale Latent Semantic Indexing Applications. In *CIKM 2008*, 153-162.
[^landauer2008]: Landauer, T. K., McNamara, D. S., Dennis, S., and Kintsch, W. (eds.) (2007). *Handbook of Latent Semantic Analysis*. Lawrence Erlbaum Associates.
[^howard1999]: Howard, M. W., and Kahana, M. J. (1999). Contextual Variability and Serial Position Effects in Free Recall. *Journal of Experimental Psychology: Learning, Memory, and Cognition*, 25(4), 923-941.
[^aclanthology_lsa_w2v]: Altszyler, E., Sigman, M., Ribeiro, S., and Slezak, D. F. (2017). Comparative study of LSA vs Word2vec embeddings in small corpora: a case study in dreams database. *arXiv:1610.01520*. [arxiv.org/abs/1610.01520](https://arxiv.org/abs/1610.01520).

