# Latent Dirichlet allocation

> Source: https://aiwiki.ai/wiki/latent_dirichlet_allocation
> Updated: 2026-07-11
> Categories: Artificial Intelligence, Machine Learning, Natural Language Processing, Statistics
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Latent Dirichlet allocation** (**LDA**) is a generative probabilistic model that discovers the hidden thematic structure in a collection of documents by treating each document as a mixture of a small number of latent topics, and each topic as a probability distribution over the vocabulary. Given an unlabeled corpus, LDA infers both the topic mixture for every document and the word distribution for every topic, making it the foundational algorithm for [topic modeling](/wiki/topic_model) in [natural language processing](/wiki/natural_language_processing). It was introduced by [David Blei](/wiki/david_blei), [Andrew Ng](/wiki/andrew_ng), and [Michael Jordan](/wiki/michael_jordan) in a paper presented at NIPS 2002 and published in the *Journal of Machine Learning Research* (JMLR), volume 3, pages 993 to 1022, in January 2003 [1]. The authors define it plainly in the paper's opening line: "We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora" [1].

The 2003 JMLR paper is one of the most cited works in all of [machine learning](/wiki/machine_learning), with well over 50,000 citations on Google Scholar, and for roughly a decade LDA was the default tool for topic modeling on text corpora. Formally it is, in the authors' words, "a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics" [1]. The method is not specific to language. Anything that can be represented as a bag of discrete tokens, including images quantized into visual words, audio segments, log lines, or genetic sequences, can be modeled with LDA.

LDA is best understood as a Bayesian extension of probabilistic Latent Semantic Indexing ([pLSI](/wiki/probabilistic_latent_semantic_indexing)) due to Thomas Hofmann (1999) [2], which itself was a probabilistic reading of [Latent Semantic Analysis](/wiki/latent_semantic_analysis) by Deerwester and colleagues (1990) [3]. Where pLSI directly fits per-document topic proportions and offers no clean way to handle a new document, LDA places a Dirichlet prior over those proportions, turning the per-document mixture into a latent random variable. That single change makes the model fully generative, gives the parameter count a fixed ceiling that does not grow with the corpus, and lets the model assign topic distributions to documents it has never seen.

In the deep-learning era, dense neural representations have replaced LDA for many practical tasks. [Word embeddings](/wiki/word_embeddings) and contextual encoders from [transformers](/wiki/transformer) capture semantics that bag-of-words models miss, and clustering pipelines like [BERTopic](/wiki/bertopic) and [Top2Vec](/wiki/top2vec) often produce more coherent topics on contemporary corpora. LDA is still in active use, though, especially in social science, legal, and humanities work where reproducibility, simple interpretability, and the per-document topic vector are exactly what the analyst wants. It also remains a standard teaching example for [variational inference](/wiki/variational_inference), [Gibbs sampling](/wiki/gibbs_sampling), and Bayesian [graphical models](/wiki/graphical_model).

## At a glance

| Field | Value |
|---|---|
| Type | Generative probabilistic Bayesian model |
| Domain | Topic modeling on bag-of-words data |
| Authors | David Blei, Andrew Ng, Michael I. Jordan |
| Introduced | NIPS 2002; published JMLR vol. 3, pp. 993 to 1022, January 2003 |
| Citations (Google Scholar, 2024) | More than 50,000 |
| Inference engines | Variational Bayes, collapsed Gibbs sampling, online VB, stochastic variational inference |
| Predecessor | Probabilistic LSI (Hofmann, 1999) |
| Independent precursor | Pritchard, Stephens, Donnelly model in genetics (2000) |
| Reference implementation | David Blei's lda-c |
| Standard libraries | Mallet, Gensim, scikit-learn, Stan, PyMC, Spark MLlib |
| Common modern replacements | BERTopic, Top2Vec, Embedded Topic Model |


## ELI5

Imagine the news. A typical article isn't only about one thing. A piece about a new climate bill is partly about politics, partly about energy, and partly about economics. LDA tries to discover the small set of underlying themes (call them topics) running through a whole pile of articles, without anyone telling it what those themes are.

It does this by pretending each article was written by a strange but methodical author. For every word, the author first rolls a weighted die to pick a topic, then rolls a different die (one belonging to that topic) to pick a word. Topics that come up more often in the article reflect what the article is mostly about. The model then runs that story backwards: given the words you actually see, what set of topics, and what mix per article, would best explain them? After enough passes, the words cluster into coherent groups (one bucket of words about politics, another about basketball, another about cooking) and each article gets a fingerprint showing how much of each topic it contains.

No grammar, no meaning, no sentence structure. Just word counts and probability. It works much better than it has any right to.

## Where did LDA come from?

The lineage that LDA caps off goes back to [information retrieval](/wiki/information_retrieval) in the late 1980s. [Latent Semantic Analysis](/wiki/latent_semantic_analysis) (also called LSI in retrieval contexts), proposed by Deerwester, Dumais, Furnas, Landauer, and Harshman in 1990, builds a term-document matrix and applies truncated singular value decomposition to project documents into a lower-dimensional semantic space [3]. LSA captures co-occurrence structure and gives a useful similarity measure between documents, but it is a linear-algebraic technique, not a probabilistic model. It has no notion of a topic as a probability distribution, and the projected coordinates can be negative, which makes them awkward to interpret.

Thomas Hofmann's [pLSI](/wiki/probabilistic_latent_semantic_indexing) (1999) reframed LSA as a probabilistic mixture model [2]. In pLSI, every word in a document is assumed to be drawn from one of K topic distributions, and each document has its own mixing weights over those topics. Maximum likelihood estimation via [expectation-maximization](/wiki/expectation_maximization) then learns both the topics and the per-document weights. pLSI gave the field a clean probabilistic semantics for topic modeling, but Blei, Ng, and Jordan pointed out two structural weaknesses [1]:

- The per-document weights are model parameters, one set per document, so the number of parameters grows linearly with the corpus. The model cannot generalize cleanly to a new document.
- pLSI is not a proper generative model at the document level. It can generate words inside an existing document, but it has no mechanism for sampling a new document from scratch.

LDA fixes both problems by adding a [Dirichlet prior](/wiki/dirichlet_distribution) over the per-document topic proportions and (in the smoothed version most people use) over the topic-word distributions as well. The Dirichlet is the conjugate prior of the categorical and multinomial distributions, which keeps the math tractable. The result is a fully generative, fully Bayesian alternative to pLSI with a fixed number of hyperparameters.

For full historical accuracy, the population-genetics community arrived at a structurally similar model three years before LDA. Pritchard, Stephens, and Donnelly (2000), in their paper introducing the STRUCTURE program, modeled each individual's genome as a mixture of K ancestral populations, with each population modeled as a vector of allele frequencies [31]. The fitted mixture proportions describe how much of an individual's genome comes from each ancestral population. Translated into the LDA vocabulary, individuals correspond to documents, ancestral populations to topics, alleles to words, and the inference target is identical: a per-document mixture over latent groups together with per-group distributions over the observation vocabulary. Blei et al. cited the Pritchard work in later expositions; the two lineages converged on the same Bayesian admixture model from different application domains.

## How does LDA work?

The core idea is a simple two-stage generative story for each word: first pick a topic from the document's topic mixture, then pick a word from that topic's word distribution. Let $$K$$ be the number of topics, $$V$$ the vocabulary size, $$D$$ the number of documents, and $$N_d$$ the length of document d. The smoothed LDA model has two Dirichlet hyperparameters, $$\alpha$$ (a $$K$$-vector) governing document-topic proportions, and $$\beta$$ or $$\eta$$ (a $$V$$-vector) governing topic-word distributions. The generative story, as written in the original paper, is [1]:

```
for each topic k = 1..K:
    phi_k ~ Dirichlet(beta)              # word distribution for topic k

for each document d = 1..D:
    theta_d ~ Dirichlet(alpha)           # topic distribution for document d
    for each word position n = 1..N_d:
        z_{d,n} ~ Categorical(theta_d)   # pick a topic
        w_{d,n} ~ Categorical(phi_{z_{d,n}})  # pick a word from that topic
```

In plate notation, three plates capture the structure: an outer plate of size K for topics (containing phi), an outer plate of size D for documents (containing theta), and an inner plate of size N_d for word positions inside each document (containing the latent topic z and the observed word w). Only the words $$w_{d,n}$$ are observed; $$\theta$$, $$\phi$$, and the topic assignments $$z$$ are all latent. This is exactly the three-level hierarchy (corpus-level topics, document-level mixtures, word-level assignments) that Blei et al. describe as "a three-level hierarchical Bayesian model" [1].

A few features of this construction are worth pulling out. The model treats words as exchangeable inside a document (a bag-of-words assumption), and treats documents as exchangeable inside the corpus. There is no syntax, no word order, no notion of sentence boundary. By de Finetti's theorem, the exchangeability assumption corresponds exactly to the latent-mixture form Blei et al. write down. The Dirichlet prior on $$\theta$$ has a clean interpretation: small $$\alpha$$ values (less than 1) push each document toward using only a few topics; large $$\alpha$$ values spread mass across many topics. The same logic applies to $$\beta$$ on the word side. In practice analysts often use symmetric Dirichlet priors with small alpha, since real documents tend to be about a small handful of things at a time.

The joint distribution induced by the model factorizes as

$$
p(\phi, \theta, z, w \mid \alpha, \beta) = \prod_k p(\phi_k \mid \beta) \prod_d p(\theta_d \mid \alpha) \prod_d \prod_n p(z_{d,n} \mid \theta_d)\, p(w_{d,n} \mid \phi_{z_{d,n}})
$$

The quantity practitioners want is the posterior $$p(\phi, \theta, z \mid w, \alpha, \beta)$$. Computing it exactly requires marginalizing over the discrete topic assignments and the continuous theta and phi simultaneously. That marginal sum has $$K^N$$ terms (where $$N$$ is the total number of word tokens in the corpus) and is intractable for any realistic dataset [1].

## How is LDA trained? Inference algorithms

Because the posterior is intractable, every practical use of LDA depends on an approximate inference algorithm. The two workhorses are mean-field variational Bayes (introduced in the original paper) and collapsed Gibbs sampling (popularized by Griffiths and Steyvers in 2004). Four families have dominated overall.

### Variational Bayes

The original Blei, Ng, and Jordan paper introduced a mean-field [variational inference](/wiki/variational_inference) algorithm [1]. The method posits a fully factorized variational distribution

$$
q(\theta, z \mid \gamma, \lambda) = \prod_d q(\theta_d \mid \gamma_d) \prod_n q(z_{d,n} \mid \phi_{d,n})
$$

in which $$\theta_d$$ is approximated by a Dirichlet with free parameter $$\gamma_d$$, and each topic assignment $$z_{d,n}$$ is approximated by an independent categorical with free parameter $$\phi_{d,n}$$. (The notation collides with the topic-word phi above; in the paper the per-word variational categorical is also called phi.) The algorithm then maximizes the evidence lower bound (ELBO) by coordinate ascent, yielding fixed-point updates that look like soft EM steps. Each pass over a document updates phi_{d,n} for each word and gamma_d for the document; corpus-level updates handle the topic-word distributions in the smoothed version of the model.

Variational EM gives a deterministic, reasonably fast algorithm that scales linearly in the corpus size and is easy to parallelize over documents. The price is a bias: the mean-field approximation underestimates posterior variance and tends to find peakier topic distributions than fully Bayesian methods would. For most practical purposes the bias is acceptable.

### Collapsed Gibbs sampling

Griffiths and Steyvers, working independently, introduced a [Gibbs sampling](/wiki/gibbs_sampling) algorithm for LDA that became enormously popular after their 2004 PNAS paper "Finding Scientific Topics" [4]. Their key trick was to integrate out $$\theta$$ and $$\phi$$ analytically, exploiting Dirichlet-multinomial conjugacy, so that the only variables left to sample are the discrete topic assignments $$z_{d,n}$$. This is called collapsed Gibbs sampling.

The sampler iterates over every word token in the corpus and resamples its topic assignment from the conditional

$$
p(z_{d,n} = k \mid z_{-(d,n)}, w) \propto (n^{-(d,n)}_{d,k} + \alpha_k) \cdot \frac{n^{-(d,n)}_{k,w_{d,n}} + \beta}{n^{-(d,n)}_{k,\cdot} + V\beta}
$$

where $$n_{d,k}$$ is the count of words in document d assigned to topic k, $$n_{k,v}$$ is the count of times word v has been assigned to topic k corpus-wide, and the superscript indicates the count is taken excluding the current token. After a burn-in period the samples can be averaged to estimate theta and phi. Each pass over the corpus is computationally similar to a pass of variational EM, but the algorithm is exceptionally simple to implement (the inner loop is essentially counts and a normalized vector) and it converges to the true posterior in the limit, which variational methods do not.

Collapsed Gibbs sampling is the algorithm behind [Mallet](/wiki/mallet), the topic-modeling toolkit from the University of Massachusetts Amherst, and remains the inference engine in many social-science applications because of its simplicity and well-understood behavior.

### Online variational Bayes

For very large corpora, Hoffman, Blei, and Bach introduced an online variational Bayes algorithm at NeurIPS 2010 [5]. The trick is to apply stochastic optimization to the ELBO: each step samples a mini-batch of documents (often a single document), runs the per-document updates to convergence, and then takes a noisy gradient step on the topic-word parameters. The resulting algorithm streams through a corpus in a single pass while still optimizing the variational objective in expectation.

Online LDA is the algorithm behind [Gensim](/wiki/gensim)'s `LdaModel` and is what makes LDA practical for billion-token corpora and streaming text. Hoffman and colleagues later generalized the approach into [stochastic variational inference](/wiki/stochastic_variational_inference) (SVI), published in JMLR in 2013, which extends the same ideas to a broad class of conjugate-exponential-family models [6].

### Other inference methods

A range of other approximate inference techniques have been applied to LDA, including expectation propagation, particle filters, Riemann manifold Hamiltonian Monte Carlo, and stochastic gradient Langevin dynamics. None has displaced collapsed Gibbs or variational Bayes as the workhorse, but they appear in research papers and specialized libraries.

The table below sketches the trade-offs.

| Algorithm | Original reference | Strengths | Weaknesses |
|---|---|---|---|
| Mean-field variational Bayes | Blei, Ng, Jordan, JMLR 2003 [1] | Fast, deterministic, easy to parallelize | Biased toward peaky modes; requires algebraic derivation per model variant |
| Collapsed Gibbs sampling | Griffiths and Steyvers, PNAS 2004 [4] | Simple to implement; samples from the true posterior in the limit | Can be slow to mix; serial inner loop; harder to scale |
| Collapsed variational Bayes | Teh, Newman, Welling, NeurIPS 2007 [7] | Combines benefits of collapsing and variational methods | More complex derivation |
| Online VB | Hoffman, Blei, Bach, NeurIPS 2010 [5] | Streams over very large corpora; single pass possible | Sensitive to learning-rate schedule |
| Stochastic variational inference | Hoffman, Blei, Wang, Paisley, JMLR 2013 [6] | Generalizes online VB; applies broadly | Same as online VB |

## What are LDA's hyperparameters, and how do you choose the number of topics K?

The model's notation, as used in the Blei, Ng, Jordan paper and most follow-ups, is summarized below.

| Symbol | Meaning | Type | Notes |
|---|---|---|---|
| $$K$$ | Number of topics | Positive integer hyperparameter | Set by the user; not learned in vanilla LDA |
| $$V$$ | Vocabulary size | Positive integer | Determined by preprocessing |
| $$D$$ | Number of documents | Positive integer | Corpus size |
| $$N_d$$ | Length of document d | Positive integer | Bag of words count |
| $$\alpha$$ | Dirichlet prior on document-topic proportions $$\theta$$ | K-vector of positive reals | Symmetric or asymmetric |
| $$\beta$$ ($$\eta$$) | Dirichlet prior on topic-word distributions $$\phi$$ | V-vector of positive reals | Usually symmetric |
| $$\theta_d$$ | Topic distribution for document d | K-vector on the simplex | Latent |
| $$\phi_k$$ | Word distribution for topic k | V-vector on the simplex | Latent |
| $$z_{d,n}$$ | Topic assignment for word position n in document d | Categorical in 1..K | Latent |
| $$w_{d,n}$$ | Observed word at position n in document d | Categorical in 1..V | Observed |

LDA has three knobs that the user picks: K (number of topics), alpha (Dirichlet on theta), and beta (Dirichlet on phi). The Blei et al. paper used a symmetric Dirichlet for both priors. Wallach, Mimno, and McCallum showed in 2009 that an asymmetric prior on $$\theta$$ and a symmetric prior on $$\phi$$ gives better held-out likelihoods on most corpora; that combination is now the default in Mallet and many other toolkits [8].

Griffiths and Steyvers recommended $$\alpha = 50/K$$ and $$\beta = 0.1$$ as reasonable defaults [4]. Both alpha and the topic count K are typically chosen by some combination of held-out [perplexity](/wiki/perplexity), topic coherence (the $$C_v$$ measure of Roder, Both, and Hinneburg, 2015 [9], is the de facto standard), and direct human inspection. Perplexity often disagrees with human judgments. Chang and colleagues, in their 2009 paper "Reading Tea Leaves," introduced the word and topic intrusion tasks and showed that lower-perplexity models can in fact produce less interpretable topics [10]. Most practitioners now use coherence plus eyeballing rather than perplexity alone.

On K specifically, there is no automatic answer. Common practice is to fit several values, plot coherence and perplexity against K, and pick a knee. If you genuinely do not want to specify K, the [hierarchical Dirichlet process](/wiki/hierarchical_dirichlet_process) (HDP) by Teh, Jordan, Beal, and Blei (JASA 2006) is the nonparametric extension that lets the data choose the number of topics [11].

## What are the main extensions of LDA?

A large family of LDA descendants relax one or another of the model's assumptions. A short tour:

| Extension | Year | What it relaxes |
|---|---|---|
| [Correlated Topic Model](/wiki/correlated_topic_model) (CTM) | Blei and Lafferty, 2007 [12] | Replaces the Dirichlet prior on theta with a logistic-normal so topics can be correlated. |
| [Dynamic Topic Model](/wiki/dynamic_topic_model) (DTM) | Blei and Lafferty, 2006 [13] | Lets topics drift over time by chaining topic-word distributions through a state-space model. |
| [Author-Topic Model](/wiki/author_topic_model) | Rosen-Zvi, Griffiths, Steyvers, Smyth, 2004 [14] | Adds an author variable; each author has a Dirichlet over topics. |
| Pachinko Allocation | Li and McCallum, 2006 [15] | Models topic correlations through a directed acyclic graph rather than a single Dirichlet. |
| [Supervised LDA](/wiki/supervised_lda) (sLDA) | Mcauliffe and Blei, 2008 [16] | Adds a per-document response variable predicted from the topic mixture. |
| Labeled LDA | Ramage, Hall, Nallapati, Manning, 2009 [17] | Constrains topics to align with given document labels. |
| Hierarchical LDA (hLDA) | Blei, Griffiths, Jordan, 2010 [18] | Topics live on an infinitely deep tree learned via the nested Chinese restaurant process. |
| [HDP](/wiki/hierarchical_dirichlet_process) | Teh, Jordan, Beal, Blei, 2006 [11] | Nonparametric: number of topics is inferred. |
| Structural Topic Model (STM) | Roberts, Stewart, Tingley, et al., 2014 [19] | Lets document-level covariates influence topic prevalence and content. |

Several of these (CTM, DTM, sLDA, STM) became standard tools in their own right, especially in computational social science.

## What software implements LDA?

LDA was an early benchmark for many open-source machine-learning libraries. The implementations differ in inference algorithm, scale, and ergonomics.

| Tool | Language | Inference | Notes |
|---|---|---|---|
| `lda-c` | C | Variational EM | David Blei's original implementation, accompanying the JMLR paper. |
| [Mallet](/wiki/mallet) | Java | Collapsed Gibbs (with hyperparameter optimization) | UMass Amherst toolkit; very popular in digital humanities and social science. |
| [Gensim](/wiki/gensim) | Python | Online VB | Radim Rehurek's library; default LDA tool in modern Python NLP. |
| `scikit-learn` `LatentDirichletAllocation` | Python | Online VB | Wraps the Hoffman, Blei, Bach algorithm; integrates with scikit-learn pipelines. |
| `lda` (R) | R | Collapsed Gibbs | Jonathan Chang's package; widely used in academic R workflows. |
| `stm` (R) | R | Variational EM | Implements the Structural Topic Model (Roberts et al.). |
| `Vowpal Wabbit` | C++ | Online VB | Designed for very large corpora. |
| `LightLDA`, `WarpLDA`, `Familia` | C++ | Distributed Gibbs / VB | Industrial-scale implementations from Microsoft, THU, and Baidu, capable of training on hundreds of millions of documents. |
| Stan | Stan / R / Python | Hamiltonian Monte Carlo, ADVI | General-purpose probabilistic programming; LDA written in tens of lines of Stan code. |
| PyMC | Python | NUTS, ADVI | Probabilistic programming alternative; useful for custom LDA variants. |
| Spark MLlib | Scala / Java / Python | EM and online VB | Distributed implementation in Apache Spark; targets large clusters. |
| TensorFlow Probability | Python | Variational and MCMC | LDA included as a tutorial example for tfp.distributions. |

For most projects under, say, a few million documents, Gensim or Mallet is the right starting point. Mallet often produces noticeably better topics out of the box because of its asymmetric prior and its built-in hyperparameter optimization.

## What is LDA used for?

The original LDA paper used Reuters newswire and a corpus of TREC AP articles for evaluation [1]. Within a few years the model had been applied to almost any large discrete corpus people cared about.

- **Document clustering and exploratory text analysis.** The single most common use. Run LDA, look at the topics, look at the documents loading on each topic, write the report.
- **Scientific literature analysis.** Griffiths and Steyvers's 2004 PNAS paper applied LDA to PNAS abstracts from 1991 to 2001 and recovered topics like "genetic" or "surgery" with no supervision [4]. Topic modeling of arXiv, MEDLINE, and other archives became a small industry.
- **Computational social science.** Topic models are a standard preprocessing step in studies of newspaper coverage, political speeches (the U.S. Congressional Record is a perennial favorite), survey free responses, and historical correspondence. Quinn et al.'s 2010 study of Senate speech is a representative example [20].
- **Digital humanities.** Topic models of literary corpora, eighteenth-century pamphlets, and historical newspaper archives are common in the field, where the per-document topic vector is a useful summary for distant reading.
- **Image analysis.** Fei-Fei and Perona's 2005 "Bayesian hierarchical model for natural scene categories" and follow-up work treat quantized SIFT features as visual words and apply LDA to images for unsupervised scene categorization [21].
- **Population genetics.** The Pritchard, Stephens, Donnelly (2000) STRUCTURE model, mathematically equivalent to LDA on allele data, became the standard tool for inferring population structure from genotype data and remains in heavy use; the original paper has more than 30,000 citations [31].
- **Bioinformatics.** Topic models have been applied to protein sequence motifs, genomic feature usage, and microbiome composition data.
- **Recommendation systems.** Variants like collaborative topic regression (Wang and Blei, 2011 [22]) combine LDA over content with matrix factorization over user-item interactions.
- **Audio and speech.** Topic models over acoustic words have been used for music genre identification and speaker analysis.
- **Industrial log analysis and customer review summarization.** A pragmatic, low-glamour use that still shows up in production today.

## How does LDA compare to BERTopic and neural topic models?

For most general English text in 2024 to 2026, neural topic methods are at least competitive with LDA on standard coherence metrics, and often better on perceived topic quality. The shift mirrors the broader [transformer](/wiki/transformer) revolution: dense contextual [word embeddings](/wiki/word_embeddings) capture similarity between words like "cat" and "feline" that bag-of-words LDA cannot.

| Method | Year | Representation | Topic structure | Strengths vs LDA | Weaknesses vs LDA |
|---|---|---|---|---|---|
| LDA | 2003 | Bag of words; word counts | Topics are categorical distributions over the vocabulary | Probabilistic, well-understood, interpretable, gives a per-document topic vector | Bag-of-words assumption; needs K specified; sensitive to preprocessing |
| ProdLDA / NVDM | Srivastava and Sutton, 2017; Miao et al. 2016 [23][24] | Bag-of-words inputs; neural autoencoder | Latent Dirichlet-like distribution learned via amortized VI | Often higher coherence; trains by gradient descent | Less stable; harder to interpret variational parameters |
| Embedded Topic Model (ETM) | Dieng, Ruiz, Blei, 2020 [25] | Word embeddings + latent topic embeddings | Topics live in the same space as words | Handles rare words well; combines topic and embedding interpretations | More moving parts |
| [Top2Vec](/wiki/top2vec) | Angelov, 2020 [26] | doc2vec or transformer embeddings | Cluster centroids in embedding space | No need to specify K; very interpretable nearest-word topics | Topics are not a probability distribution; comparison to LDA on the same metrics is awkward |
| [BERTopic](/wiki/bertopic) | Grootendorst, 2022 [27] | Transformer (e.g. SBERT) embeddings, UMAP, HDBSCAN, c-TF-IDF | Cluster-derived topics with TF-IDF style word lists | Excellent topic coherence, dynamic topic and class-based extensions, hot ecosystem | Pipeline complexity, harder to do principled Bayesian inference, sensitive to embedding choice |

Two points worth keeping in mind. First, BERTopic and Top2Vec are not really probabilistic topic models in the LDA sense: they are clustering algorithms in embedding space dressed up with a per-cluster word list. They generally do not give you a per-document distribution over topics; each document tends to get a single hard cluster assignment, though BERTopic offers soft assignment via cosine similarity on the centroids. If your downstream analysis depends on a clean theta vector for each document (regression on topic prevalence, propensity-style matching, time-series of topic prevalence), LDA-style models are still the more honest tool.

Second, for languages other than English, for short-text social media, and for highly specialized vocabularies, the relative performance of LDA and neural topic methods is more mixed than headline benchmarks suggest.

## What are the limitations of LDA?

The model has several well-known weaknesses, most of which are direct consequences of its assumptions.

- **Bag of words.** Word order is discarded, so any phenomenon that depends on syntax (negation, modification, compound terms) is invisible to the model. Bigram and n-gram extensions exist but introduce sparsity.
- **Fixed K.** Vanilla LDA requires the user to specify the number of topics. Picking K wrong gives either redundant or merged topics. Nonparametric extensions like the [hierarchical Dirichlet process](/wiki/hierarchical_dirichlet_process) avoid this at the cost of a more complex inference algorithm.
- **Preprocessing sensitivity.** Stop-word lists, lemmatization, minimum and maximum document frequency cutoffs, and rare-word thresholds all change the topics that LDA finds. The model is brittle to vocabulary choices in a way that neural embedders are not.
- **Hyperparameter sensitivity.** Topics shift noticeably with alpha and beta. Mallet's asymmetric prior optimization helps; symmetric defaults often produce poorer topics than the asymmetric alternative recommended by Wallach, Mimno, and McCallum [8].
- **Short text.** Tweets, search queries, and product titles violate the bag-of-words mixture assumption because each document has too few tokens to estimate a mixture. Dedicated short-text variants (biterm topic model, sparse topical coding) and embedding-based methods generally do better here.
- **Polysemy.** A word with several senses (for example bank as a financial institution and as a riverbank) cannot occupy distinct positions in different topics under the categorical observation model. Contextual embeddings handle this naturally.
- **Local optima and run-to-run variability.** Both variational EM and Gibbs sampling can land in different topic configurations on different runs. Best practice is to fit several seeds and pick the model with the highest likelihood or coherence.
- **No external knowledge.** LDA does not know that "car" and "automobile" are related. Knowledge-informed extensions (seed words, anchor words, prior topic-word distributions) exist but are bolt-ons rather than first-class features.

## Is LDA still used in the LLM era?

LDA stopped being the default approach to text analysis somewhere around the time BERT came out in 2018, and the gap widened as sentence-level transformer encoders matured. Most new applied research that wants a general semantic representation now reaches for a [large language model](/wiki/large_language_model) or a sentence embedder rather than LDA. There are still good reasons to prefer LDA in 2026, though, and they are not nostalgic.

The per-document topic vector is the underrated feature. It is a small, interpretable mixture summary of every document, with each entry tied to a named distribution over the vocabulary. That makes it a natural input to downstream regressions, clustering, time-series analysis, and propensity-score matching. LLM embeddings, even when they are higher quality, are dense vectors with no obvious interpretation and no guarantee of stability across model versions.

LDA is also fully reproducible without GPU access, which matters for archival work, low-resource settings, and any analysis that has to run for a long time on the same hardware. And the math is transparent. A graduate student can derive the variational updates or the Gibbs sampler from scratch in an afternoon, then run the model on a laptop. Few neural alternatives clear that bar.

For short, single-paragraph corpora and for topic discovery on very small datasets, LDA tends to perform poorly because the bag-of-words assumption strips away too much signal. There it is fair to reach for an embedding-based method. For long-form documents in any field where the topic structure itself is the object of study, the model still earns its keep.

It is also worth noting that the methodological lineage running from LDA into modern generative AI is direct. The variational EM machinery developed for LDA generalized into stochastic variational inference, then into amortized variational inference and the [variational autoencoder](/wiki/variational_autoencoder) (Kingma and Welling, 2014). Many of the same researchers (Blei, Hoffman, Mohamed, Welling, Kingma) were involved in both lines. LDA was the proving ground for an entire family of probabilistic methods that the field still uses, even when the headline applications have moved on.

The paper has been recognized as a landmark in machine learning. David Blei received the ACM Doctoral Dissertation Award in 2004 for the dissertation that introduced LDA, was elected an [ACM](/wiki/acm) Fellow in 2017 with the citation specifically referencing topic-model contributions, and has been elected to the [American Academy of Arts and Sciences](/wiki/american_academy_of_arts_and_sciences) (2022) and as a Fellow of the Royal Statistical Society. Andrew Ng and Michael Jordan are also widely recognized for adjacent work; Jordan, in particular, is a National Academy of Sciences member whose Bayesian and graphical-model contributions LDA built on. The Blei, Ng, Jordan paper itself is consistently among the top-cited papers in JMLR's history.

## See also

- [Topic model](/wiki/topic_model)
- [Latent semantic analysis](/wiki/latent_semantic_analysis)
- [Probabilistic Latent Semantic Indexing](/wiki/probabilistic_latent_semantic_indexing)
- [Variational inference](/wiki/variational_inference)
- [Markov chain Monte Carlo](/wiki/mcmc)
- [Gibbs sampling](/wiki/gibbs_sampling)
- [Expectation-maximization](/wiki/expectation_maximization)
- [Bayesian inference](/wiki/bayesian_inference)
- [Dirichlet distribution](/wiki/dirichlet_distribution)
- [Hierarchical Dirichlet process](/wiki/hierarchical_dirichlet_process)
- [BERTopic](/wiki/bertopic)
- [Top2Vec](/wiki/top2vec)
- [Natural language processing](/wiki/natural_language_processing)

## References

1. Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). "Latent Dirichlet Allocation." *Journal of Machine Learning Research*, 3: 993 to 1022. (Earlier version at NIPS 2002.)
2. Hofmann, T. (1999). "Probabilistic Latent Semantic Indexing." *Proceedings of the 22nd ACM SIGIR Conference on Research and Development in Information Retrieval*, 50 to 57.
3. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990). "Indexing by Latent Semantic Analysis." *Journal of the American Society for Information Science*, 41(6): 391 to 407.
4. Griffiths, T. L. and Steyvers, M. (2004). "Finding Scientific Topics." *Proceedings of the National Academy of Sciences*, 101(suppl 1): 5228 to 5235.
5. Hoffman, M., Blei, D. M., and Bach, F. (2010). "Online Learning for Latent Dirichlet Allocation." *Advances in Neural Information Processing Systems* 23: 856 to 864.
6. Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. (2013). "Stochastic Variational Inference." *Journal of Machine Learning Research*, 14(1): 1303 to 1347.
7. Teh, Y. W., Newman, D., and Welling, M. (2007). "A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation." *Advances in Neural Information Processing Systems* 19: 1353 to 1360.
8. Wallach, H. M., Mimno, D., and McCallum, A. (2009). "Rethinking LDA: Why Priors Matter." *Advances in Neural Information Processing Systems* 22: 1973 to 1981.
9. Roder, M., Both, A., and Hinneburg, A. (2015). "Exploring the Space of Topic Coherence Measures." *Proceedings of the 8th ACM International Conference on Web Search and Data Mining*, 399 to 408.
10. Chang, J., Boyd-Graber, J., Gerrish, S., Wang, C., and Blei, D. M. (2009). "Reading Tea Leaves: How Humans Interpret Topic Models." *Advances in Neural Information Processing Systems* 22: 288 to 296.
11. Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2006). "Hierarchical Dirichlet Processes." *Journal of the American Statistical Association*, 101(476): 1566 to 1581.
12. Blei, D. M. and Lafferty, J. D. (2007). "A Correlated Topic Model of Science." *Annals of Applied Statistics*, 1(1): 17 to 35.
13. Blei, D. M. and Lafferty, J. D. (2006). "Dynamic Topic Models." *Proceedings of the 23rd International Conference on Machine Learning*, 113 to 120.
14. Rosen-Zvi, M., Griffiths, T., Steyvers, M., and Smyth, P. (2004). "The Author-Topic Model for Authors and Documents." *Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence*, 487 to 494.
15. Li, W. and McCallum, A. (2006). "Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations." *Proceedings of the 23rd International Conference on Machine Learning*, 577 to 584.
16. Mcauliffe, J. D. and Blei, D. M. (2008). "Supervised Topic Models." *Advances in Neural Information Processing Systems* 20: 121 to 128.
17. Ramage, D., Hall, D., Nallapati, R., and Manning, C. D. (2009). "Labeled LDA: A Supervised Topic Model for Credit Attribution in Multi-Labeled Corpora." *Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing*, 248 to 256.
18. Blei, D. M., Griffiths, T. L., and Jordan, M. I. (2010). "The Nested Chinese Restaurant Process and Bayesian Nonparametric Inference of Topic Hierarchies." *Journal of the ACM*, 57(2): 7.
19. Roberts, M. E., Stewart, B. M., Tingley, D., Lucas, C., Leder-Luis, J., Gadarian, S. K., Albertson, B., and Rand, D. G. (2014). "Structural Topic Models for Open-Ended Survey Responses." *American Journal of Political Science*, 58(4): 1064 to 1082.
20. Quinn, K. M., Monroe, B. L., Colaresi, M., Crespin, M. H., and Radev, D. R. (2010). "How to Analyze Political Attention with Minimal Assumptions and Costs." *American Journal of Political Science*, 54(1): 209 to 228.
21. Fei-Fei, L. and Perona, P. (2005). "A Bayesian Hierarchical Model for Learning Natural Scene Categories." *IEEE Conference on Computer Vision and Pattern Recognition*, 524 to 531.
22. Wang, C. and Blei, D. M. (2011). "Collaborative Topic Modeling for Recommending Scientific Articles." *Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*, 448 to 456.
23. Srivastava, A. and Sutton, C. (2017). "Autoencoding Variational Inference for Topic Models." *International Conference on Learning Representations*.
24. Miao, Y., Yu, L., and Blunsom, P. (2016). "Neural Variational Inference for Text Processing." *International Conference on Machine Learning*, 1727 to 1736.
25. Dieng, A. B., Ruiz, F. J. R., and Blei, D. M. (2020). "Topic Modeling in Embedding Spaces." *Transactions of the Association for Computational Linguistics*, 8: 439 to 453.
26. Angelov, D. (2020). "Top2Vec: Distributed Representations of Topics." arXiv:2008.09470.
27. Grootendorst, M. (2022). "BERTopic: Neural Topic Modeling with a Class-Based TF-IDF Procedure." arXiv:2203.05794.
28. Murphy, K. P. (2012). *Machine Learning: A Probabilistic Perspective*. MIT Press, chapter 27 (Latent variable models for discrete data).
29. Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*. Springer, chapter 10 (Approximate Inference) and chapter 13 (Sequential Data).
30. Blei, D. M. (2012). "Probabilistic Topic Models." *Communications of the ACM*, 55(4): 77 to 84.
31. Pritchard, J. K., Stephens, M., and Donnelly, P. (2000). "Inference of Population Structure Using Multilocus Genotype Data." *Genetics*, 155(2): 945 to 959.