Latent Dirichlet allocation

Latent Dirichlet Allocation (LDA) is a generative probabilistic model for collections of discrete data, most often text. It treats each document as a mixture of a small number of latent topics, and each topic as a distribution over the vocabulary. Given a corpus, LDA infers both the topic mixture for every document and the word distribution for every topic without using any labels. The model was introduced by David Blei, Andrew Ng, and Michael Jordan in a paper presented at NIPS 2002 and published in the Journal of Machine Learning Research in January 2003 ^[1]. That paper became one of the most cited works in machine learning, with well over 50,000 citations on Google Scholar, and for roughly a decade LDA was the default tool for topic modeling on text corpora. The method is not specific to language. Anything that can be represented as a bag of discrete tokens, including images quantized into visual words, audio segments, log lines, or genetic sequences, can be modeled with LDA.

LDA is best understood as a Bayesian extension of probabilistic Latent Semantic Indexing (pLSI) due to Thomas Hofmann (1999) ^[2], which itself was a probabilistic reading of Latent Semantic Analysis by Deerwester and colleagues (1990) ^[3]. Where pLSI directly fits per-document topic proportions and offers no clean way to handle a new document, LDA places a Dirichlet prior over those proportions, turning the per-document mixture into a latent random variable. That single change makes the model fully generative, gives the parameter count a fixed ceiling that does not grow with the corpus, and lets the model assign topic distributions to documents it has never seen.

In the deep-learning era, dense neural representations have replaced LDA for many practical tasks. Word embeddings and contextual encoders from transformers capture semantics that bag-of-words models miss, and clustering pipelines like BERTopic and Top2Vec often produce more coherent topics on contemporary corpora. LDA is still in active use, though, especially in social science, legal, and humanities work where reproducibility, simple interpretability, and the per-document topic vector are exactly what the analyst wants. It also remains a standard teaching example for variational inference, Gibbs sampling, and Bayesian graphical models.

At a glance

Field	Value
Type	Generative probabilistic Bayesian model
Domain	Topic modeling on bag-of-words data
Authors	David Blei, Andrew Ng, Michael I. Jordan
Introduced	NIPS 2002; published JMLR vol. 3, January 2003
Citations (Google Scholar, 2024)	More than 50,000
Inference engines	Variational Bayes, collapsed Gibbs sampling, online VB, stochastic variational inference
Predecessor	Probabilistic LSI (Hofmann, 1999)
Independent precursor	Pritchard, Stephens, Donnelly model in genetics (2000)
Reference implementation	David Blei's lda-c
Standard libraries	Mallet, Gensim, scikit-learn, Stan, PyMC, Spark MLlib
Common modern replacements	BERTopic, Top2Vec, Embedded Topic Model

ELI5

Imagine the news. A typical article isn't only about one thing. A piece about a new climate bill is partly about politics, partly about energy, and partly about economics. LDA tries to discover the small set of underlying themes (call them topics) running through a whole pile of articles, without anyone telling it what those themes are.

It does this by pretending each article was written by a strange but methodical author. For every word, the author first rolls a weighted die to pick a topic, then rolls a different die (one belonging to that topic) to pick a word. Topics that come up more often in the article reflect what the article is mostly about. The model then runs that story backwards: given the words you actually see, what set of topics, and what mix per article, would best explain them? After enough passes, the words cluster into coherent groups (one bucket of words about politics, another about basketball, another about cooking) and each article gets a fingerprint showing how much of each topic it contains.

No grammar, no meaning, no sentence structure. Just word counts and probability. It works much better than it has any right to.

Background and predecessors

The lineage that LDA caps off goes back to information retrieval in the late 1980s. Latent Semantic Analysis (also called LSI in retrieval contexts), proposed by Deerwester, Dumais, Furnas, Landauer, and Harshman in 1990, builds a term-document matrix and applies truncated singular value decomposition to project documents into a lower-dimensional semantic space ^[3]. LSA captures co-occurrence structure and gives a useful similarity measure between documents, but it is a linear-algebraic technique, not a probabilistic model. It has no notion of a topic as a probability distribution, and the projected coordinates can be negative, which makes them awkward to interpret.

Thomas Hofmann's pLSI (1999) reframed LSA as a probabilistic mixture model ^[2]. In pLSI, every word in a document is assumed to be drawn from one of K topic distributions, and each document has its own mixing weights over those topics. Maximum likelihood estimation via expectation-maximization then learns both the topics and the per-document weights. pLSI gave the field a clean probabilistic semantics for topic modeling, but Blei, Ng, and Jordan pointed out two structural weaknesses ^[1]:

The per-document weights are model parameters, one set per document, so the number of parameters grows linearly with the corpus. The model cannot generalize cleanly to a new document.
pLSI is not a proper generative model at the document level. It can generate words inside an existing document, but it has no mechanism for sampling a new document from scratch.

LDA fixes both problems by adding a Dirichlet prior over the per-document topic proportions and (in the smoothed version most people use) over the topic-word distributions as well. The Dirichlet is the conjugate prior of the categorical and multinomial distributions, which keeps the math tractable. The result is a fully generative, fully Bayesian alternative to pLSI with a fixed number of hyperparameters.

For full historical accuracy, the population-genetics community arrived at a structurally similar model three years before LDA. Pritchard, Stephens, and Donnelly (2000), in their paper introducing the STRUCTURE program, modeled each individual's genome as a mixture of K ancestral populations, with each population modeled as a vector of allele frequencies ^[31]. The fitted mixture proportions describe how much of an individual's genome comes from each ancestral population. Translated into the LDA vocabulary, individuals correspond to documents, ancestral populations to topics, alleles to words, and the inference target is identical: a per-document mixture over latent groups together with per-group distributions over the observation vocabulary. Blei et al. cited the Pritchard work in later expositions; the two lineages converged on the same Bayesian admixture model from different application domains.

The generative model

Let K be the number of topics, V the vocabulary size, D the number of documents, and N_d the length of document d. The smoothed LDA model has two Dirichlet hyperparameters, alpha (a K-vector) governing document-topic proportions, and beta or eta (a V-vector) governing topic-word distributions. The generative story, as written in the original paper, is ^[1]:

for each topic k = 1..K:
    phi_k ~ Dirichlet(beta)              # word distribution for topic k

for each document d = 1..D:
    theta_d ~ Dirichlet(alpha)           # topic distribution for document d
    for each word position n = 1..N_d:
        z_{d,n} ~ Categorical(theta_d)   # pick a topic
        w_{d,n} ~ Categorical(phi_{z_{d,n}})  # pick a word from that topic

In plate notation, three plates capture the structure: an outer plate of size K for topics (containing phi), an outer plate of size D for documents (containing theta), and an inner plate of size N_d for word positions inside each document (containing the latent topic z and the observed word w). Only the words w_{d,n} are observed; theta, phi, and the topic assignments z are all latent.

A few features of this construction are worth pulling out. The model treats words as exchangeable inside a document (a bag-of-words assumption), and treats documents as exchangeable inside the corpus. There is no syntax, no word order, no notion of sentence boundary. By de Finetti's theorem, the exchangeability assumption corresponds exactly to the latent-mixture form Blei et al. write down. The Dirichlet prior on theta has a clean interpretation: small alpha values (less than 1) push each document toward using only a few topics; large alpha values spread mass across many topics. The same logic applies to beta on the word side. In practice analysts often use symmetric Dirichlet priors with small alpha, since real documents tend to be about a small handful of things at a time.

The joint distribution induced by the model factorizes as

p(phi, theta, z, w | alpha, beta)
  = prod_k p(phi_k | beta)
  * prod_d p(theta_d | alpha)
  * prod_d prod_n p(z_{d,n} | theta_d) p(w_{d,n} | phi_{z_{d,n}}).

The quantity practitioners want is the posterior p(phi, theta, z | w, alpha, beta). Computing it exactly requires marginalizing over the discrete topic assignments and the continuous theta and phi simultaneously. That marginal sum has K^N terms (where N is the total number of word tokens in the corpus) and is intractable for any realistic dataset ^[1].

Inference

Because the posterior is intractable, every practical use of LDA depends on an approximate inference algorithm. Four families have dominated.

Variational Bayes

The original Blei, Ng, and Jordan paper introduced a mean-field variational inference algorithm ^[1]. The method posits a fully factorized variational distribution

q(theta, z | gamma, lambda) = prod_d q(theta_d | gamma_d) prod_n q(z_{d,n} | phi_{d,n})

in which theta_d is approximated by a Dirichlet with free parameter gamma_d, and each topic assignment z_{d,n} is approximated by an independent categorical with free parameter phi_{d,n}. (The notation collides with the topic-word phi above; in the paper the per-word variational categorical is also called phi.) The algorithm then maximizes the evidence lower bound (ELBO) by coordinate ascent, yielding fixed-point updates that look like soft EM steps. Each pass over a document updates phi_{d,n} for each word and gamma_d for the document; corpus-level updates handle the topic-word distributions in the smoothed version of the model.

Variational EM gives a deterministic, reasonably fast algorithm that scales linearly in the corpus size and is easy to parallelize over documents. The price is a bias: the mean-field approximation underestimates posterior variance and tends to find peakier topic distributions than fully Bayesian methods would. For most practical purposes the bias is acceptable.

Collapsed Gibbs sampling

Griffiths and Steyvers, working independently, introduced a Gibbs sampling algorithm for LDA that became enormously popular after their 2004 PNAS paper "Finding Scientific Topics" ^[4]. Their key trick was to integrate out theta and phi analytically, exploiting Dirichlet-multinomial conjugacy, so that the only variables left to sample are the discrete topic assignments z_{d,n}. This is called collapsed Gibbs sampling.

The sampler iterates over every word token in the corpus and resamples its topic assignment from the conditional

p(z_{d,n} = k | z_{-(d,n)}, w) ~ (n^{-(d,n)}_{d,k} + alpha_k)
                              * (n^{-(d,n)}_{k,w_{d,n}} + beta) / (n^{-(d,n)}_{k,.} + V*beta)

where n_{d,k} is the count of words in document d assigned to topic k, n_{k,v} is the count of times word v has been assigned to topic k corpus-wide, and the superscript indicates the count is taken excluding the current token. After a burn-in period the samples can be averaged to estimate theta and phi. Each pass over the corpus is computationally similar to a pass of variational EM, but the algorithm is exceptionally simple to implement (the inner loop is essentially counts and a normalized vector) and it converges to the true posterior in the limit, which variational methods do not.

Collapsed Gibbs sampling is the algorithm behind Mallet, the topic-modeling toolkit from the University of Massachusetts Amherst, and remains the inference engine in many social-science applications because of its simplicity and well-understood behavior.

Online variational Bayes

For very large corpora, Hoffman, Blei, and Bach introduced an online variational Bayes algorithm at NeurIPS 2010 ^[5]. The trick is to apply stochastic optimization to the ELBO: each step samples a mini-batch of documents (often a single document), runs the per-document updates to convergence, and then takes a noisy gradient step on the topic-word parameters. The resulting algorithm streams through a corpus in a single pass while still optimizing the variational objective in expectation.

Online LDA is the algorithm behind Gensim's LdaModel and is what makes LDA practical for billion-token corpora and streaming text. Hoffman and colleagues later generalized the approach into stochastic variational inference (SVI), published in JMLR in 2013, which extends the same ideas to a broad class of conjugate-exponential-family models ^[6].

Other inference methods

A range of other approximate inference techniques have been applied to LDA, including expectation propagation, particle filters, Riemann manifold Hamiltonian Monte Carlo, and stochastic gradient Langevin dynamics. None has displaced collapsed Gibbs or variational Bayes as the workhorse, but they appear in research papers and specialized libraries.

The table below sketches the trade-offs.

Algorithm	Original reference	Strengths	Weaknesses
Mean-field variational Bayes	Blei, Ng, Jordan, JMLR 2003 ^[1]	Fast, deterministic, easy to parallelize	Biased toward peaky modes; requires algebraic derivation per model variant
Collapsed Gibbs sampling	Griffiths and Steyvers, PNAS 2004 ^[4]	Simple to implement; samples from the true posterior in the limit	Can be slow to mix; serial inner loop; harder to scale
Collapsed variational Bayes	Teh, Newman, Welling, NeurIPS 2007 ^[7]	Combines benefits of collapsing and variational methods	More complex derivation
Online VB	Hoffman, Blei, Bach, NeurIPS 2010 ^[5]	Streams over very large corpora; single pass possible	Sensitive to learning-rate schedule
Stochastic variational inference	Hoffman, Blei, Wang, Paisley, JMLR 2013 ^[6]	Generalizes online VB; applies broadly	Same as online VB

Hyperparameters and choosing K

The model's notation, as used in the Blei, Ng, Jordan paper and most follow-ups, is summarized below.

Symbol	Meaning	Type	Notes
K	Number of topics	Positive integer hyperparameter	Set by the user; not learned in vanilla LDA
V	Vocabulary size	Positive integer	Determined by preprocessing
D	Number of documents	Positive integer	Corpus size
N_d	Length of document d	Positive integer	Bag of words count
alpha	Dirichlet prior on document-topic proportions theta	K-vector of positive reals	Symmetric or asymmetric
beta (eta)	Dirichlet prior on topic-word distributions phi	V-vector of positive reals	Usually symmetric
theta_d	Topic distribution for document d	K-vector on the simplex	Latent
phi_k	Word distribution for topic k	V-vector on the simplex	Latent
z_{d,n}	Topic assignment for word position n in document d	Categorical in 1..K	Latent
w_{d,n}	Observed word at position n in document d	Categorical in 1..V	Observed

LDA has three knobs that the user picks: K (number of topics), alpha (Dirichlet on theta), and beta (Dirichlet on phi). The Blei et al. paper used a symmetric Dirichlet for both priors. Wallach, Mimno, and McCallum showed in 2009 that an asymmetric prior on theta and a symmetric prior on phi gives better held-out likelihoods on most corpora; that combination is now the default in Mallet and many other toolkits ^[8].

Griffiths and Steyvers recommended alpha = 50/K and beta = 0.1 as reasonable defaults ^[4]. Both alpha and the topic count K are typically chosen by some combination of held-out perplexity, topic coherence (the C_v measure of Roder, Both, and Hinneburg, 2015 ^[9], is the de facto standard), and direct human inspection. Perplexity often disagrees with human judgments. Chang and colleagues, in their 2009 paper "Reading Tea Leaves," introduced the word and topic intrusion tasks and showed that lower-perplexity models can in fact produce less interpretable topics ^[10]. Most practitioners now use coherence plus eyeballing rather than perplexity alone.

On K specifically, there is no automatic answer. Common practice is to fit several values, plot coherence and perplexity against K, and pick a knee. If you genuinely do not want to specify K, the hierarchical Dirichlet process (HDP) by Teh, Jordan, Beal, and Blei (JASA 2006) is the nonparametric extension that lets the data choose the number of topics ^[11].

Extensions

A large family of LDA descendants relax one or another of the model's assumptions. A short tour:

Extension	Year	What it relaxes
Correlated Topic Model (CTM)	Blei and Lafferty, 2007 ^[12]	Replaces the Dirichlet prior on theta with a logistic-normal so topics can be correlated.
Dynamic Topic Model (DTM)	Blei and Lafferty, 2006 ^[13]	Lets topics drift over time by chaining topic-word distributions through a state-space model.
Author-Topic Model	Rosen-Zvi, Griffiths, Steyvers, Smyth, 2004 ^[14]	Adds an author variable; each author has a Dirichlet over topics.
Pachinko Allocation	Li and McCallum, 2006 ^[15]	Models topic correlations through a directed acyclic graph rather than a single Dirichlet.
Supervised LDA (sLDA)	Mcauliffe and Blei, 2008 ^[16]	Adds a per-document response variable predicted from the topic mixture.
Labeled LDA	Ramage, Hall, Nallapati, Manning, 2009 ^[17]	Constrains topics to align with given document labels.
Hierarchical LDA (hLDA)	Blei, Griffiths, Jordan, 2010 ^[18]	Topics live on an infinitely deep tree learned via the nested Chinese restaurant process.
HDP	Teh, Jordan, Beal, Blei, 2006 ^[11]	Nonparametric: number of topics is inferred.
Structural Topic Model (STM)	Roberts, Stewart, Tingley, et al., 2014 ^[19]	Lets document-level covariates influence topic prevalence and content.

Several of these (CTM, DTM, sLDA, STM) became standard tools in their own right, especially in computational social science.

Software

LDA was an early benchmark for many open-source machine-learning libraries. The implementations differ in inference algorithm, scale, and ergonomics.

Tool	Language	Inference	Notes
`lda-c`	C	Variational EM	David Blei's original implementation, accompanying the JMLR paper.
Mallet	Java	Collapsed Gibbs (with hyperparameter optimization)	UMass Amherst toolkit; very popular in digital humanities and social science.
Gensim	Python	Online VB	Radim Rehurek's library; default LDA tool in modern Python NLP.
`scikit-learn` `LatentDirichletAllocation`	Python	Online VB	Wraps the Hoffman, Blei, Bach algorithm; integrates with scikit-learn pipelines.
`lda` (R)	R	Collapsed Gibbs	Jonathan Chang's package; widely used in academic R workflows.
`stm` (R)	R	Variational EM	Implements the Structural Topic Model (Roberts et al.).
`Vowpal Wabbit`	C++	Online VB	Designed for very large corpora.
`LightLDA`, `WarpLDA`, `Familia`	C++	Distributed Gibbs / VB	Industrial-scale implementations from Microsoft, THU, and Baidu, capable of training on hundreds of millions of documents.
Stan	Stan / R / Python	Hamiltonian Monte Carlo, ADVI	General-purpose probabilistic programming; LDA written in tens of lines of Stan code.
PyMC	Python	NUTS, ADVI	Probabilistic programming alternative; useful for custom LDA variants.
Spark MLlib	Scala / Java / Python	EM and online VB	Distributed implementation in Apache Spark; targets large clusters.
TensorFlow Probability	Python	Variational and MCMC	LDA included as a tutorial example for tfp.distributions.

For most projects under, say, a few million documents, Gensim or Mallet is the right starting point. Mallet often produces noticeably better topics out of the box because of its asymmetric prior and its built-in hyperparameter optimization.

Applications

The original LDA paper used Reuters newswire and a corpus of TREC AP articles for evaluation ^[1]. Within a few years the model had been applied to almost any large discrete corpus people cared about.

Document clustering and exploratory text analysis. The single most common use. Run LDA, look at the topics, look at the documents loading on each topic, write the report.
Scientific literature analysis. Griffiths and Steyvers's 2004 PNAS paper applied LDA to PNAS abstracts from 1991 to 2001 and recovered topics like "genetic" or "surgery" with no supervision ^[4]. Topic modeling of arXiv, MEDLINE, and other archives became a small industry.
Computational social science. Topic models are a standard preprocessing step in studies of newspaper coverage, political speeches (the U.S. Congressional Record is a perennial favorite), survey free responses, and historical correspondence. Quinn et al.'s 2010 study of Senate speech is a representative example ^[20].
Digital humanities. Topic models of literary corpora, eighteenth-century pamphlets, and historical newspaper archives are common in the field, where the per-document topic vector is a useful summary for distant reading.
Image analysis. Fei-Fei and Perona's 2005 "Bayesian hierarchical model for natural scene categories" and follow-up work treat quantized SIFT features as visual words and apply LDA to images for unsupervised scene categorization ^[21].
Population genetics. The Pritchard, Stephens, Donnelly (2000) STRUCTURE model, mathematically equivalent to LDA on allele data, became the standard tool for inferring population structure from genotype data and remains in heavy use; the original paper has more than 30,000 citations ^[31].
Bioinformatics. Topic models have been applied to protein sequence motifs, genomic feature usage, and microbiome composition data.
Recommendation systems. Variants like collaborative topic regression (Wang and Blei, 2011 ^[22]) combine LDA over content with matrix factorization over user-item interactions.
Audio and speech. Topic models over acoustic words have been used for music genre identification and speaker analysis.
Industrial log analysis and customer review summarization. A pragmatic, low-glamour use that still shows up in production today.

Comparison to modern alternatives

For most general English text in 2024 to 2026, neural topic methods are at least competitive with LDA on standard coherence metrics, and often better on perceived topic quality. The shift mirrors the broader transformer revolution: dense contextual word embeddings capture similarity between words like "cat" and "feline" that bag-of-words LDA cannot.

Method	Year	Representation	Topic structure	Strengths vs LDA	Weaknesses vs LDA
LDA	2003	Bag of words; word counts	Topics are categorical distributions over the vocabulary	Probabilistic, well-understood, interpretable, gives a per-document topic vector	Bag-of-words assumption; needs K specified; sensitive to preprocessing
ProdLDA / NVDM	Srivastava and Sutton, 2017; Miao et al. 2016 ^[23]^[24]	Bag-of-words inputs; neural autoencoder	Latent Dirichlet-like distribution learned via amortized VI	Often higher coherence; trains by gradient descent	Less stable; harder to interpret variational parameters
Embedded Topic Model (ETM)	Dieng, Ruiz, Blei, 2020 ^[25]	Word embeddings + latent topic embeddings	Topics live in the same space as words	Handles rare words well; combines topic and embedding interpretations	More moving parts
Top2Vec	Angelov, 2020 ^[26]	doc2vec or transformer embeddings	Cluster centroids in embedding space	No need to specify K; very interpretable nearest-word topics	Topics are not a probability distribution; comparison to LDA on the same metrics is awkward
BERTopic	Grootendorst, 2022 ^[27]	Transformer (e.g. SBERT) embeddings, UMAP, HDBSCAN, c-TF-IDF	Cluster-derived topics with TF-IDF style word lists	Excellent topic coherence, dynamic topic and class-based extensions, hot ecosystem	Pipeline complexity, harder to do principled Bayesian inference, sensitive to embedding choice

Two points worth keeping in mind. First, BERTopic and Top2Vec are not really probabilistic topic models in the LDA sense: they are clustering algorithms in embedding space dressed up with a per-cluster word list. They generally do not give you a per-document distribution over topics; each document tends to get a single hard cluster assignment, though BERTopic offers soft assignment via cosine similarity on the centroids. If your downstream analysis depends on a clean theta vector for each document (regression on topic prevalence, propensity-style matching, time-series of topic prevalence), LDA-style models are still the more honest tool.

Second, for languages other than English, for short-text social media, and for highly specialized vocabularies, the relative performance of LDA and neural topic methods is more mixed than headline benchmarks suggest.

Limitations

The model has several well-known weaknesses, most of which are direct consequences of its assumptions.

Bag of words. Word order is discarded, so any phenomenon that depends on syntax (negation, modification, compound terms) is invisible to the model. Bigram and n-gram extensions exist but introduce sparsity.
Fixed K. Vanilla LDA requires the user to specify the number of topics. Picking K wrong gives either redundant or merged topics. Nonparametric extensions like the hierarchical Dirichlet process avoid this at the cost of a more complex inference algorithm.
Preprocessing sensitivity. Stop-word lists, lemmatization, minimum and maximum document frequency cutoffs, and rare-word thresholds all change the topics that LDA finds. The model is brittle to vocabulary choices in a way that neural embedders are not.
Hyperparameter sensitivity. Topics shift noticeably with alpha and beta. Mallet's asymmetric prior optimization helps; symmetric defaults often produce poorer topics than the asymmetric alternative recommended by Wallach, Mimno, and McCallum ^[8].
Short text. Tweets, search queries, and product titles violate the bag-of-words mixture assumption because each document has too few tokens to estimate a mixture. Dedicated short-text variants (biterm topic model, sparse topical coding) and embedding-based methods generally do better here.
Polysemy. A word with several senses (for example bank as a financial institution and as a riverbank) cannot occupy distinct positions in different topics under the categorical observation model. Contextual embeddings handle this naturally.
Local optima and run-to-run variability. Both variational EM and Gibbs sampling can land in different topic configurations on different runs. Best practice is to fit several seeds and pick the model with the highest likelihood or coherence.
No external knowledge. LDA does not know that "car" and "automobile" are related. Knowledge-informed extensions (seed words, anchor words, prior topic-word distributions) exist but are bolt-ons rather than first-class features.

Place in the LLM era

LDA stopped being the default approach to text analysis somewhere around the time BERT came out in 2018, and the gap widened as sentence-level transformer encoders matured. Most new applied research that wants a general semantic representation now reaches for a large language model or a sentence embedder rather than LDA. There are still good reasons to prefer LDA in 2026, though, and they are not nostalgic.

The per-document topic vector is the underrated feature. It is a small, interpretable mixture summary of every document, with each entry tied to a named distribution over the vocabulary. That makes it a natural input to downstream regressions, clustering, time-series analysis, and propensity-score matching. LLM embeddings, even when they are higher quality, are dense vectors with no obvious interpretation and no guarantee of stability across model versions.

LDA is also fully reproducible without GPU access, which matters for archival work, low-resource settings, and any analysis that has to run for a long time on the same hardware. And the math is transparent. A graduate student can derive the variational updates or the Gibbs sampler from scratch in an afternoon, then run the model on a laptop. Few neural alternatives clear that bar.

For short, single-paragraph corpora and for topic discovery on very small datasets, LDA tends to perform poorly because the bag-of-words assumption strips away too much signal. There it is fair to reach for an embedding-based method. For long-form documents in any field where the topic structure itself is the object of study, the model still earns its keep.

It is also worth noting that the methodological lineage running from LDA into modern generative AI is direct. The variational EM machinery developed for LDA generalized into stochastic variational inference, then into amortized variational inference and the variational autoencoder (Kingma and Welling, 2014). Many of the same researchers (Blei, Hoffman, Mohamed, Welling, Kingma) were involved in both lines. LDA was the proving ground for an entire family of probabilistic methods that the field still uses, even when the headline applications have moved on.

The paper has been recognized as a landmark in machine learning. David Blei received the ACM Doctoral Dissertation Award in 2004 for the dissertation that introduced LDA, was elected an ACM Fellow in 2017 with the citation specifically referencing topic-model contributions, and has been elected to the American Academy of Arts and Sciences (2022) and as a Fellow of the Royal Statistical Society. Andrew Ng and Michael Jordan are also widely recognized for adjacent work; Jordan, in particular, is a National Academy of Sciences member whose Bayesian and graphical-model contributions LDA built on. The Blei, Ng, Jordan paper itself is consistently among the top-cited papers in JMLR's history.

References

Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). "Latent Dirichlet Allocation." *Journal of Machine Learning Research*, 3: 993 to 1022. (Earlier version at NIPS 2002.)
Hofmann, T. (1999). "Probabilistic Latent Semantic Indexing." *Proceedings of the 22nd ACM SIGIR Conference on Research and Development in Information Retrieval*, 50 to 57.
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990). "Indexing by Latent Semantic Analysis." *Journal of the American Society for Information Science*, 41(6): 391 to 407.
Griffiths, T. L. and Steyvers, M. (2004). "Finding Scientific Topics." *Proceedings of the National Academy of Sciences*, 101(suppl 1): 5228 to 5235.
Hoffman, M., Blei, D. M., and Bach, F. (2010). "Online Learning for Latent Dirichlet Allocation." *Advances in Neural Information Processing Systems* 23: 856 to 864.
Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. (2013). "Stochastic Variational Inference." *Journal of Machine Learning Research*, 14(1): 1303 to 1347.
Teh, Y. W., Newman, D., and Welling, M. (2007). "A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation." *Advances in Neural Information Processing Systems* 19: 1353 to 1360.
Wallach, H. M., Mimno, D., and McCallum, A. (2009). "Rethinking LDA: Why Priors Matter." *Advances in Neural Information Processing Systems* 22: 1973 to 1981.
Roder, M., Both, A., and Hinneburg, A. (2015). "Exploring the Space of Topic Coherence Measures." *Proceedings of the 8th ACM International Conference on Web Search and Data Mining*, 399 to 408.
Chang, J., Boyd-Graber, J., Gerrish, S., Wang, C., and Blei, D. M. (2009). "Reading Tea Leaves: How Humans Interpret Topic Models." *Advances in Neural Information Processing Systems* 22: 288 to 296.
Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2006). "Hierarchical Dirichlet Processes." *Journal of the American Statistical Association*, 101(476): 1566 to 1581.
Blei, D. M. and Lafferty, J. D. (2007). "A Correlated Topic Model of Science." *Annals of Applied Statistics*, 1(1): 17 to 35.
Blei, D. M. and Lafferty, J. D. (2006). "Dynamic Topic Models." *Proceedings of the 23rd International Conference on Machine Learning*, 113 to 120.
Rosen-Zvi, M., Griffiths, T., Steyvers, M., and Smyth, P. (2004). "The Author-Topic Model for Authors and Documents." *Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence*, 487 to 494.
Li, W. and McCallum, A. (2006). "Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations." *Proceedings of the 23rd International Conference on Machine Learning*, 577 to 584.
Mcauliffe, J. D. and Blei, D. M. (2008). "Supervised Topic Models." *Advances in Neural Information Processing Systems* 20: 121 to 128.
Ramage, D., Hall, D., Nallapati, R., and Manning, C. D. (2009). "Labeled LDA: A Supervised Topic Model for Credit Attribution in Multi-Labeled Corpora." *Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing*, 248 to 256.
Blei, D. M., Griffiths, T. L., and Jordan, M. I. (2010). "The Nested Chinese Restaurant Process and Bayesian Nonparametric Inference of Topic Hierarchies." *Journal of the ACM*, 57(2): 7.
Roberts, M. E., Stewart, B. M., Tingley, D., Lucas, C., Leder-Luis, J., Gadarian, S. K., Albertson, B., and Rand, D. G. (2014). "Structural Topic Models for Open-Ended Survey Responses." *American Journal of Political Science*, 58(4): 1064 to 1082.
Quinn, K. M., Monroe, B. L., Colaresi, M., Crespin, M. H., and Radev, D. R. (2010). "How to Analyze Political Attention with Minimal Assumptions and Costs." *American Journal of Political Science*, 54(1): 209 to 228.
Fei-Fei, L. and Perona, P. (2005). "A Bayesian Hierarchical Model for Learning Natural Scene Categories." *IEEE Conference on Computer Vision and Pattern Recognition*, 524 to 531.
Wang, C. and Blei, D. M. (2011). "Collaborative Topic Modeling for Recommending Scientific Articles." *Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*, 448 to 456.
Srivastava, A. and Sutton, C. (2017). "Autoencoding Variational Inference for Topic Models." *International Conference on Learning Representations*.
Miao, Y., Yu, L., and Blunsom, P. (2016). "Neural Variational Inference for Text Processing." *International Conference on Machine Learning*, 1727 to 1736.
Dieng, A. B., Ruiz, F. J. R., and Blei, D. M. (2020). "Topic Modeling in Embedding Spaces." *Transactions of the Association for Computational Linguistics*, 8: 439 to 453.
Angelov, D. (2020). "Top2Vec: Distributed Representations of Topics." arXiv:2008.09470.
Grootendorst, M. (2022). "BERTopic: Neural Topic Modeling with a Class-Based TF-IDF Procedure." arXiv:2203.05794.
Murphy, K. P. (2012). *Machine Learning: A Probabilistic Perspective*. MIT Press, chapter 27 (Latent variable models for discrete data).
Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*. Springer, chapter 10 (Approximate Inference) and chapter 13 (Sequential Data).
Blei, D. M. (2012). "Probabilistic Topic Models." *Communications of the ACM*, 55(4): 77 to 84.
Pritchard, J. K., Stephens, M., and Donnelly, P. (2000). "Inference of Population Structure Using Multilocus Genotype Data." *Genetics*, 155(2): 945 to 959.

Latent Dirichlet allocation

At a glance

ELI5

Background and predecessors

The generative model

Inference

Variational Bayes

Collapsed Gibbs sampling

Online variational Bayes

Other inference methods

Hyperparameters and choosing K

Extensions

Software

Applications

Comparison to modern alternatives

Limitations

Place in the LLM era

See also

References

Improve this article

At a glance

ELI5

Background and predecessors

The generative model

Inference

Variational Bayes

Collapsed Gibbs sampling

Online variational Bayes

Other inference methods

Hyperparameters and choosing K

Extensions

Software

Applications

Comparison to modern alternatives

Limitations

Place in the LLM era

See also

References

At a glance

ELI5

Background and predecessors

The generative model

Inference

Variational Bayes

Collapsed Gibbs sampling

Online variational Bayes

Other inference methods

Hyperparameters and choosing K

Extensions

Software

Applications

Comparison to modern alternatives

Limitations

Place in the LLM era

See also

References

Improve this article

Related Articles

Prior belief

Markov Chain Monte Carlo (MCMC)

ARC-AGI 2

Expectation-Maximization (EM) Algorithm

DBSCAN

AUC-ROC

At a glance

ELI5

Background and predecessors

The generative model

Inference

Variational Bayes

Collapsed Gibbs sampling

Online variational Bayes

Other inference methods

Hyperparameters and choosing K

Extensions

Software

Applications

Comparison to modern alternatives

Limitations

Place in the LLM era

See also

References

Related Articles

Prior belief

Markov Chain Monte Carlo (MCMC)

ARC-AGI 2

Expectation-Maximization (EM) Algorithm

DBSCAN

AUC-ROC