Latent Dirichlet allocation
Last reviewed
May 2, 2026
Sources
31 citations
Review status
Source-backed
Revision
v2 ยท 5,685 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 2, 2026
Sources
31 citations
Review status
Source-backed
Revision
v2 ยท 5,685 words
Add missing citations, update stale details, or suggest a clearer explanation.
Latent Dirichlet Allocation (LDA) is a generative probabilistic model for collections of discrete data, most often text. It treats each document as a mixture of a small number of latent topics, and each topic as a distribution over the vocabulary. Given a corpus, LDA infers both the topic mixture for every document and the word distribution for every topic without using any labels. The model was introduced by David Blei, Andrew Ng, and Michael Jordan in a paper presented at NIPS 2002 and published in the Journal of Machine Learning Research in January 2003 [1]. That paper became one of the most cited works in machine learning, with well over 50,000 citations on Google Scholar, and for roughly a decade LDA was the default tool for topic modeling on text corpora. The method is not specific to language. Anything that can be represented as a bag of discrete tokens, including images quantized into visual words, audio segments, log lines, or genetic sequences, can be modeled with LDA.
LDA is best understood as a Bayesian extension of probabilistic Latent Semantic Indexing (pLSI) due to Thomas Hofmann (1999) [2], which itself was a probabilistic reading of Latent Semantic Analysis by Deerwester and colleagues (1990) [3]. Where pLSI directly fits per-document topic proportions and offers no clean way to handle a new document, LDA places a Dirichlet prior over those proportions, turning the per-document mixture into a latent random variable. That single change makes the model fully generative, gives the parameter count a fixed ceiling that does not grow with the corpus, and lets the model assign topic distributions to documents it has never seen.
In the deep-learning era, dense neural representations have replaced LDA for many practical tasks. Word embeddings and contextual encoders from transformers capture semantics that bag-of-words models miss, and clustering pipelines like BERTopic and Top2Vec often produce more coherent topics on contemporary corpora. LDA is still in active use, though, especially in social science, legal, and humanities work where reproducibility, simple interpretability, and the per-document topic vector are exactly what the analyst wants. It also remains a standard teaching example for variational inference, Gibbs sampling, and Bayesian graphical models.
| Field | Value |
|---|---|
| Type | Generative probabilistic Bayesian model |
| Domain | Topic modeling on bag-of-words data |
| Authors | David Blei, Andrew Ng, Michael I. Jordan |
| Introduced | NIPS 2002; published JMLR vol. 3, January 2003 |
| Citations (Google Scholar, 2024) | More than 50,000 |
| Inference engines | Variational Bayes, collapsed Gibbs sampling, online VB, stochastic variational inference |
| Predecessor | Probabilistic LSI (Hofmann, 1999) |
| Independent precursor | Pritchard, Stephens, Donnelly model in genetics (2000) |
| Reference implementation | David Blei's lda-c |
| Standard libraries | Mallet, Gensim, scikit-learn, Stan, PyMC, Spark MLlib |
| Common modern replacements | BERTopic, Top2Vec, Embedded Topic Model |
Imagine the news. A typical article isn't only about one thing. A piece about a new climate bill is partly about politics, partly about energy, and partly about economics. LDA tries to discover the small set of underlying themes (call them topics) running through a whole pile of articles, without anyone telling it what those themes are.
It does this by pretending each article was written by a strange but methodical author. For every word, the author first rolls a weighted die to pick a topic, then rolls a different die (one belonging to that topic) to pick a word. Topics that come up more often in the article reflect what the article is mostly about. The model then runs that story backwards: given the words you actually see, what set of topics, and what mix per article, would best explain them? After enough passes, the words cluster into coherent groups (one bucket of words about politics, another about basketball, another about cooking) and each article gets a fingerprint showing how much of each topic it contains.
No grammar, no meaning, no sentence structure. Just word counts and probability. It works much better than it has any right to.
The lineage that LDA caps off goes back to information retrieval in the late 1980s. Latent Semantic Analysis (also called LSI in retrieval contexts), proposed by Deerwester, Dumais, Furnas, Landauer, and Harshman in 1990, builds a term-document matrix and applies truncated singular value decomposition to project documents into a lower-dimensional semantic space [3]. LSA captures co-occurrence structure and gives a useful similarity measure between documents, but it is a linear-algebraic technique, not a probabilistic model. It has no notion of a topic as a probability distribution, and the projected coordinates can be negative, which makes them awkward to interpret.
Thomas Hofmann's pLSI (1999) reframed LSA as a probabilistic mixture model [2]. In pLSI, every word in a document is assumed to be drawn from one of K topic distributions, and each document has its own mixing weights over those topics. Maximum likelihood estimation via expectation-maximization then learns both the topics and the per-document weights. pLSI gave the field a clean probabilistic semantics for topic modeling, but Blei, Ng, and Jordan pointed out two structural weaknesses [1]:
LDA fixes both problems by adding a Dirichlet prior over the per-document topic proportions and (in the smoothed version most people use) over the topic-word distributions as well. The Dirichlet is the conjugate prior of the categorical and multinomial distributions, which keeps the math tractable. The result is a fully generative, fully Bayesian alternative to pLSI with a fixed number of hyperparameters.
For full historical accuracy, the population-genetics community arrived at a structurally similar model three years before LDA. Pritchard, Stephens, and Donnelly (2000), in their paper introducing the STRUCTURE program, modeled each individual's genome as a mixture of K ancestral populations, with each population modeled as a vector of allele frequencies [31]. The fitted mixture proportions describe how much of an individual's genome comes from each ancestral population. Translated into the LDA vocabulary, individuals correspond to documents, ancestral populations to topics, alleles to words, and the inference target is identical: a per-document mixture over latent groups together with per-group distributions over the observation vocabulary. Blei et al. cited the Pritchard work in later expositions; the two lineages converged on the same Bayesian admixture model from different application domains.
Let K be the number of topics, V the vocabulary size, D the number of documents, and N_d the length of document d. The smoothed LDA model has two Dirichlet hyperparameters, alpha (a K-vector) governing document-topic proportions, and beta or eta (a V-vector) governing topic-word distributions. The generative story, as written in the original paper, is [1]:
for each topic k = 1..K:
phi_k ~ Dirichlet(beta) # word distribution for topic k
for each document d = 1..D:
theta_d ~ Dirichlet(alpha) # topic distribution for document d
for each word position n = 1..N_d:
z_{d,n} ~ Categorical(theta_d) # pick a topic
w_{d,n} ~ Categorical(phi_{z_{d,n}}) # pick a word from that topic
In plate notation, three plates capture the structure: an outer plate of size K for topics (containing phi), an outer plate of size D for documents (containing theta), and an inner plate of size N_d for word positions inside each document (containing the latent topic z and the observed word w). Only the words w_{d,n} are observed; theta, phi, and the topic assignments z are all latent.
A few features of this construction are worth pulling out. The model treats words as exchangeable inside a document (a bag-of-words assumption), and treats documents as exchangeable inside the corpus. There is no syntax, no word order, no notion of sentence boundary. By de Finetti's theorem, the exchangeability assumption corresponds exactly to the latent-mixture form Blei et al. write down. The Dirichlet prior on theta has a clean interpretation: small alpha values (less than 1) push each document toward using only a few topics; large alpha values spread mass across many topics. The same logic applies to beta on the word side. In practice analysts often use symmetric Dirichlet priors with small alpha, since real documents tend to be about a small handful of things at a time.
The joint distribution induced by the model factorizes as
p(phi, theta, z, w | alpha, beta)
= prod_k p(phi_k | beta)
* prod_d p(theta_d | alpha)
* prod_d prod_n p(z_{d,n} | theta_d) p(w_{d,n} | phi_{z_{d,n}}).
The quantity practitioners want is the posterior p(phi, theta, z | w, alpha, beta). Computing it exactly requires marginalizing over the discrete topic assignments and the continuous theta and phi simultaneously. That marginal sum has K^N terms (where N is the total number of word tokens in the corpus) and is intractable for any realistic dataset [1].
Because the posterior is intractable, every practical use of LDA depends on an approximate inference algorithm. Four families have dominated.
The original Blei, Ng, and Jordan paper introduced a mean-field variational inference algorithm [1]. The method posits a fully factorized variational distribution
q(theta, z | gamma, lambda) = prod_d q(theta_d | gamma_d) prod_n q(z_{d,n} | phi_{d,n})
in which theta_d is approximated by a Dirichlet with free parameter gamma_d, and each topic assignment z_{d,n} is approximated by an independent categorical with free parameter phi_{d,n}. (The notation collides with the topic-word phi above; in the paper the per-word variational categorical is also called phi.) The algorithm then maximizes the evidence lower bound (ELBO) by coordinate ascent, yielding fixed-point updates that look like soft EM steps. Each pass over a document updates phi_{d,n} for each word and gamma_d for the document; corpus-level updates handle the topic-word distributions in the smoothed version of the model.
Variational EM gives a deterministic, reasonably fast algorithm that scales linearly in the corpus size and is easy to parallelize over documents. The price is a bias: the mean-field approximation underestimates posterior variance and tends to find peakier topic distributions than fully Bayesian methods would. For most practical purposes the bias is acceptable.
Griffiths and Steyvers, working independently, introduced a Gibbs sampling algorithm for LDA that became enormously popular after their 2004 PNAS paper "Finding Scientific Topics" [4]. Their key trick was to integrate out theta and phi analytically, exploiting Dirichlet-multinomial conjugacy, so that the only variables left to sample are the discrete topic assignments z_{d,n}. This is called collapsed Gibbs sampling.
The sampler iterates over every word token in the corpus and resamples its topic assignment from the conditional
p(z_{d,n} = k | z_{-(d,n)}, w) ~ (n^{-(d,n)}_{d,k} + alpha_k)
* (n^{-(d,n)}_{k,w_{d,n}} + beta) / (n^{-(d,n)}_{k,.} + V*beta)
where n_{d,k} is the count of words in document d assigned to topic k, n_{k,v} is the count of times word v has been assigned to topic k corpus-wide, and the superscript indicates the count is taken excluding the current token. After a burn-in period the samples can be averaged to estimate theta and phi. Each pass over the corpus is computationally similar to a pass of variational EM, but the algorithm is exceptionally simple to implement (the inner loop is essentially counts and a normalized vector) and it converges to the true posterior in the limit, which variational methods do not.
Collapsed Gibbs sampling is the algorithm behind Mallet, the topic-modeling toolkit from the University of Massachusetts Amherst, and remains the inference engine in many social-science applications because of its simplicity and well-understood behavior.
For very large corpora, Hoffman, Blei, and Bach introduced an online variational Bayes algorithm at NeurIPS 2010 [5]. The trick is to apply stochastic optimization to the ELBO: each step samples a mini-batch of documents (often a single document), runs the per-document updates to convergence, and then takes a noisy gradient step on the topic-word parameters. The resulting algorithm streams through a corpus in a single pass while still optimizing the variational objective in expectation.
Online LDA is the algorithm behind Gensim's LdaModel and is what makes LDA practical for billion-token corpora and streaming text. Hoffman and colleagues later generalized the approach into stochastic variational inference (SVI), published in JMLR in 2013, which extends the same ideas to a broad class of conjugate-exponential-family models [6].
A range of other approximate inference techniques have been applied to LDA, including expectation propagation, particle filters, Riemann manifold Hamiltonian Monte Carlo, and stochastic gradient Langevin dynamics. None has displaced collapsed Gibbs or variational Bayes as the workhorse, but they appear in research papers and specialized libraries.
The table below sketches the trade-offs.
| Algorithm | Original reference | Strengths | Weaknesses |
|---|---|---|---|
| Mean-field variational Bayes | Blei, Ng, Jordan, JMLR 2003 [1] | Fast, deterministic, easy to parallelize | Biased toward peaky modes; requires algebraic derivation per model variant |
| Collapsed Gibbs sampling | Griffiths and Steyvers, PNAS 2004 [4] | Simple to implement; samples from the true posterior in the limit | Can be slow to mix; serial inner loop; harder to scale |
| Collapsed variational Bayes | Teh, Newman, Welling, NeurIPS 2007 [7] | Combines benefits of collapsing and variational methods | More complex derivation |
| Online VB | Hoffman, Blei, Bach, NeurIPS 2010 [5] | Streams over very large corpora; single pass possible | Sensitive to learning-rate schedule |
| Stochastic variational inference | Hoffman, Blei, Wang, Paisley, JMLR 2013 [6] | Generalizes online VB; applies broadly | Same as online VB |
The model's notation, as used in the Blei, Ng, Jordan paper and most follow-ups, is summarized below.
| Symbol | Meaning | Type | Notes |
|---|---|---|---|
| K | Number of topics | Positive integer hyperparameter | Set by the user; not learned in vanilla LDA |
| V | Vocabulary size | Positive integer | Determined by preprocessing |
| D | Number of documents | Positive integer | Corpus size |
| N_d | Length of document d | Positive integer | Bag of words count |
| alpha | Dirichlet prior on document-topic proportions theta | K-vector of positive reals | Symmetric or asymmetric |
| beta (eta) | Dirichlet prior on topic-word distributions phi | V-vector of positive reals | Usually symmetric |
| theta_d | Topic distribution for document d | K-vector on the simplex | Latent |
| phi_k | Word distribution for topic k | V-vector on the simplex | Latent |
| z_{d,n} | Topic assignment for word position n in document d | Categorical in 1..K | Latent |
| w_{d,n} | Observed word at position n in document d | Categorical in 1..V | Observed |
LDA has three knobs that the user picks: K (number of topics), alpha (Dirichlet on theta), and beta (Dirichlet on phi). The Blei et al. paper used a symmetric Dirichlet for both priors. Wallach, Mimno, and McCallum showed in 2009 that an asymmetric prior on theta and a symmetric prior on phi gives better held-out likelihoods on most corpora; that combination is now the default in Mallet and many other toolkits [8].
Griffiths and Steyvers recommended alpha = 50/K and beta = 0.1 as reasonable defaults [4]. Both alpha and the topic count K are typically chosen by some combination of held-out perplexity, topic coherence (the C_v measure of Roder, Both, and Hinneburg, 2015 [9], is the de facto standard), and direct human inspection. Perplexity often disagrees with human judgments. Chang and colleagues, in their 2009 paper "Reading Tea Leaves," introduced the word and topic intrusion tasks and showed that lower-perplexity models can in fact produce less interpretable topics [10]. Most practitioners now use coherence plus eyeballing rather than perplexity alone.
On K specifically, there is no automatic answer. Common practice is to fit several values, plot coherence and perplexity against K, and pick a knee. If you genuinely do not want to specify K, the hierarchical Dirichlet process (HDP) by Teh, Jordan, Beal, and Blei (JASA 2006) is the nonparametric extension that lets the data choose the number of topics [11].
A large family of LDA descendants relax one or another of the model's assumptions. A short tour:
| Extension | Year | What it relaxes |
|---|---|---|
| Correlated Topic Model (CTM) | Blei and Lafferty, 2007 [12] | Replaces the Dirichlet prior on theta with a logistic-normal so topics can be correlated. |
| Dynamic Topic Model (DTM) | Blei and Lafferty, 2006 [13] | Lets topics drift over time by chaining topic-word distributions through a state-space model. |
| Author-Topic Model | Rosen-Zvi, Griffiths, Steyvers, Smyth, 2004 [14] | Adds an author variable; each author has a Dirichlet over topics. |
| Pachinko Allocation | Li and McCallum, 2006 [15] | Models topic correlations through a directed acyclic graph rather than a single Dirichlet. |
| Supervised LDA (sLDA) | Mcauliffe and Blei, 2008 [16] | Adds a per-document response variable predicted from the topic mixture. |
| Labeled LDA | Ramage, Hall, Nallapati, Manning, 2009 [17] | Constrains topics to align with given document labels. |
| Hierarchical LDA (hLDA) | Blei, Griffiths, Jordan, 2010 [18] | Topics live on an infinitely deep tree learned via the nested Chinese restaurant process. |
| HDP | Teh, Jordan, Beal, Blei, 2006 [11] | Nonparametric: number of topics is inferred. |
| Structural Topic Model (STM) | Roberts, Stewart, Tingley, et al., 2014 [19] | Lets document-level covariates influence topic prevalence and content. |
Several of these (CTM, DTM, sLDA, STM) became standard tools in their own right, especially in computational social science.
LDA was an early benchmark for many open-source machine-learning libraries. The implementations differ in inference algorithm, scale, and ergonomics.
| Tool | Language | Inference | Notes |
|---|---|---|---|
lda-c | C | Variational EM | David Blei's original implementation, accompanying the JMLR paper. |
| Mallet | Java | Collapsed Gibbs (with hyperparameter optimization) | UMass Amherst toolkit; very popular in digital humanities and social science. |
| Gensim | Python | Online VB | Radim Rehurek's library; default LDA tool in modern Python NLP. |
scikit-learn LatentDirichletAllocation | Python | Online VB | Wraps the Hoffman, Blei, Bach algorithm; integrates with scikit-learn pipelines. |
lda (R) | R | Collapsed Gibbs | Jonathan Chang's package; widely used in academic R workflows. |
stm (R) | R | Variational EM | Implements the Structural Topic Model (Roberts et al.). |
Vowpal Wabbit | C++ | Online VB | Designed for very large corpora. |
LightLDA, WarpLDA, Familia | C++ | Distributed Gibbs / VB | Industrial-scale implementations from Microsoft, THU, and Baidu, capable of training on hundreds of millions of documents. |
| Stan | Stan / R / Python | Hamiltonian Monte Carlo, ADVI | General-purpose probabilistic programming; LDA written in tens of lines of Stan code. |
| PyMC | Python | NUTS, ADVI | Probabilistic programming alternative; useful for custom LDA variants. |
| Spark MLlib | Scala / Java / Python | EM and online VB | Distributed implementation in Apache Spark; targets large clusters. |
| TensorFlow Probability | Python | Variational and MCMC | LDA included as a tutorial example for tfp.distributions. |
For most projects under, say, a few million documents, Gensim or Mallet is the right starting point. Mallet often produces noticeably better topics out of the box because of its asymmetric prior and its built-in hyperparameter optimization.
The original LDA paper used Reuters newswire and a corpus of TREC AP articles for evaluation [1]. Within a few years the model had been applied to almost any large discrete corpus people cared about.
For most general English text in 2024 to 2026, neural topic methods are at least competitive with LDA on standard coherence metrics, and often better on perceived topic quality. The shift mirrors the broader transformer revolution: dense contextual word embeddings capture similarity between words like "cat" and "feline" that bag-of-words LDA cannot.
| Method | Year | Representation | Topic structure | Strengths vs LDA | Weaknesses vs LDA |
|---|---|---|---|---|---|
| LDA | 2003 | Bag of words; word counts | Topics are categorical distributions over the vocabulary | Probabilistic, well-understood, interpretable, gives a per-document topic vector | Bag-of-words assumption; needs K specified; sensitive to preprocessing |
| ProdLDA / NVDM | Srivastava and Sutton, 2017; Miao et al. 2016 [23][24] | Bag-of-words inputs; neural autoencoder | Latent Dirichlet-like distribution learned via amortized VI | Often higher coherence; trains by gradient descent | Less stable; harder to interpret variational parameters |
| Embedded Topic Model (ETM) | Dieng, Ruiz, Blei, 2020 [25] | Word embeddings + latent topic embeddings | Topics live in the same space as words | Handles rare words well; combines topic and embedding interpretations | More moving parts |
| Top2Vec | Angelov, 2020 [26] | doc2vec or transformer embeddings | Cluster centroids in embedding space | No need to specify K; very interpretable nearest-word topics | Topics are not a probability distribution; comparison to LDA on the same metrics is awkward |
| BERTopic | Grootendorst, 2022 [27] | Transformer (e.g. SBERT) embeddings, UMAP, HDBSCAN, c-TF-IDF | Cluster-derived topics with TF-IDF style word lists | Excellent topic coherence, dynamic topic and class-based extensions, hot ecosystem | Pipeline complexity, harder to do principled Bayesian inference, sensitive to embedding choice |
Two points worth keeping in mind. First, BERTopic and Top2Vec are not really probabilistic topic models in the LDA sense: they are clustering algorithms in embedding space dressed up with a per-cluster word list. They generally do not give you a per-document distribution over topics; each document tends to get a single hard cluster assignment, though BERTopic offers soft assignment via cosine similarity on the centroids. If your downstream analysis depends on a clean theta vector for each document (regression on topic prevalence, propensity-style matching, time-series of topic prevalence), LDA-style models are still the more honest tool.
Second, for languages other than English, for short-text social media, and for highly specialized vocabularies, the relative performance of LDA and neural topic methods is more mixed than headline benchmarks suggest.
The model has several well-known weaknesses, most of which are direct consequences of its assumptions.
LDA stopped being the default approach to text analysis somewhere around the time BERT came out in 2018, and the gap widened as sentence-level transformer encoders matured. Most new applied research that wants a general semantic representation now reaches for a large language model or a sentence embedder rather than LDA. There are still good reasons to prefer LDA in 2026, though, and they are not nostalgic.
The per-document topic vector is the underrated feature. It is a small, interpretable mixture summary of every document, with each entry tied to a named distribution over the vocabulary. That makes it a natural input to downstream regressions, clustering, time-series analysis, and propensity-score matching. LLM embeddings, even when they are higher quality, are dense vectors with no obvious interpretation and no guarantee of stability across model versions.
LDA is also fully reproducible without GPU access, which matters for archival work, low-resource settings, and any analysis that has to run for a long time on the same hardware. And the math is transparent. A graduate student can derive the variational updates or the Gibbs sampler from scratch in an afternoon, then run the model on a laptop. Few neural alternatives clear that bar.
For short, single-paragraph corpora and for topic discovery on very small datasets, LDA tends to perform poorly because the bag-of-words assumption strips away too much signal. There it is fair to reach for an embedding-based method. For long-form documents in any field where the topic structure itself is the object of study, the model still earns its keep.
It is also worth noting that the methodological lineage running from LDA into modern generative AI is direct. The variational EM machinery developed for LDA generalized into stochastic variational inference, then into amortized variational inference and the variational autoencoder (Kingma and Welling, 2014). Many of the same researchers (Blei, Hoffman, Mohamed, Welling, Kingma) were involved in both lines. LDA was the proving ground for an entire family of probabilistic methods that the field still uses, even when the headline applications have moved on.
The paper has been recognized as a landmark in machine learning. David Blei received the ACM Doctoral Dissertation Award in 2004 for the dissertation that introduced LDA, was elected an ACM Fellow in 2017 with the citation specifically referencing topic-model contributions, and has been elected to the American Academy of Arts and Sciences (2022) and as a Fellow of the Royal Statistical Society. Andrew Ng and Michael Jordan are also widely recognized for adjacent work; Jordan, in particular, is a National Academy of Sciences member whose Bayesian and graphical-model contributions LDA built on. The Blei, Ng, Jordan paper itself is consistently among the top-cited papers in JMLR's history.