Topic model

A topic model is a statistical model for discovering the abstract "topics" that occur in a collection of documents. Each document is modelled as a mixture over a small number of latent topics, and each topic is modelled as a probability distribution over words in the vocabulary. Topic models are a form of unsupervised learning: the topics are not specified in advance and no training labels are provided. They emerge from patterns of word co-occurrence across the corpus, and a human reader inspects the high-probability words of each topic to assign an interpretive label such as "genetics", "foreign policy", or "deep learning".

The modern field begins with three papers that are now standard reading: the singular value decomposition approach of Deerwester, Dumais, Furnas, Landauer, and Harshman in Indexing by Latent Semantic Analysis (1990), the probabilistic mixture model of Hofmann in Probabilistic Latent Semantic Indexing (1999), and the fully Bayesian generative model of Blei, Ng, and Jordan in Latent Dirichlet Allocation (2003). LDA in particular has been cited tens of thousands of times and remains the canonical example of a probabilistic topic model. A more recent line of work (BERTopic, Top2Vec, the Embedded Topic Model) replaces the bag-of-words representation with neural embeddings while keeping the same basic goal of producing interpretable topic descriptions.

Definition and intuition

Given a corpus of D documents drawn from a vocabulary of V words, a topic model assumes a small number K of latent topics. Each topic k is described by a distribution φ_k over the V words, so a topic about machine learning might place high probability on "network", "training", "model", "learning", and "neural". Each document d is described by a distribution θ_d over the K topics, so an article about a new convolutional architecture might be 70 percent "machine learning", 20 percent "computer vision", and 10 percent "hardware". The observed words of the document are generated by repeatedly drawing a topic from θ_d and then drawing a word from the chosen topic's distribution φ_k.

Topic models are deliberately simple in two ways. First, they treat documents as a bag of words, throwing away word order and grammar. Second, they assume a fixed vocabulary and a fixed number of topics. These simplifications make inference tractable on millions of documents while still recovering coherent themes, which is why LDA-style models remain in production use after twenty years even though deep learning has otherwise eaten the surrounding field of natural language processing.

Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA), introduced by David Blei, Andrew Ng, and Michael Jordan in the Journal of Machine Learning Research in 2003, is the canonical topic model and the starting point for almost every later variant. LDA is a generative model: it specifies a probabilistic story for how each word in each document was produced.

The generative process for a corpus of D documents and K topics is as follows. For each topic k in 1..K, draw a word distribution φ_k from a Dirichlet prior with concentration parameter β. For each document d in 1..D, draw a topic distribution θ_d from a Dirichlet prior with concentration parameter α. Then for each of the N_d word positions in document d, draw a topic assignment z_n from a categorical distribution with parameters θ_d, and draw the observed word w_n from the categorical distribution φ_{z_n}. The only observed quantities are the words w_n; the topic assignments z, the per-document topic distributions θ, and the per-topic word distributions φ are all latent and must be inferred.

The Dirichlet prior is the workhorse of the model. It is the conjugate prior of the categorical distribution, which makes posterior updates tractable, and its concentration parameters control sparsity. A small α (less than one) pushes each document toward only a few topics, while a large α spreads probability across many topics. The same logic applies to β for the topic-word distribution. Posterior inference produces an estimate of the joint distribution over θ, φ, and z given the observed words.

History and timeline

The development of topic models stretches across statistics, information retrieval, and machine learning, with each step adding a structural assumption or relaxing one. The Bayesian phase that LDA opened in 2003 produced more than a decade of variants, and the embedding phase that started around 2018 is still in progress.

Year	Model	Authors	Contribution
1990	LSA / LSI	Deerwester, Dumais, Furnas, Landauer, Harshman	SVD of TF-IDF matrix; not probabilistic but the conceptual precursor
1999	pLSA / pLSI	Hofmann	Probabilistic mixture model trained with EM
2003	LDA	Blei, Ng, Jordan	Fully Bayesian generative model with Dirichlet priors
2004	Author-Topic Model	Rosen-Zvi, Griffiths, Steyvers, Smyth	Each author has a topic distribution; document is a mixture of authors
2004	Collapsed Gibbs sampling for LDA	Griffiths, Steyvers	Practical MCMC inference; Finding scientific topics in PNAS
2006	Dynamic Topic Models	Blei, Lafferty	Topics evolve over time via state-space model
2006	Pachinko Allocation (PAM)	Li, McCallum	DAG-structured correlations among topics
2006	Hierarchical Dirichlet Process	Teh, Jordan, Beal, Blei	Nonparametric prior; K inferred automatically
2007	Correlated Topic Model	Blei, Lafferty	Logistic-normal prior allows correlated topics
2008	Supervised LDA (sLDA)	Blei, McAuliffe	Document-level response variable guides topic discovery
2010	Online LDA	Hoffman, Blei, Bach	Stochastic variational inference; scales to streaming corpora
2013	Stochastic Variational Inference	Hoffman, Blei, Wang, Paisley	General SVI framework for conjugate exponential families
2020	Embedded Topic Model (ETM)	Dieng, Ruiz, Blei	Combines LDA with word embeddings
2020	Top2Vec	Angelov	Joint embedding of words, documents, and topics
2022	BERTopic	Grootendorst	Transformer embeddings clustered with UMAP and HDBSCAN

Inference techniques

The LDA posterior is intractable in closed form because the topic assignments z and the continuous variables θ and φ are entangled, so practical implementations rely on approximate inference. The two dominant families are variational methods, which fit a tractable distribution to the true posterior, and sampling methods, which draw from it.

Variational EM was the original inference algorithm proposed by Blei, Ng, and Jordan in the 2003 paper. It introduces a fully factorised variational distribution over θ, φ, and z and minimises the KL divergence to the true posterior by coordinate ascent, using an EM-like outer loop to update the corpus-level parameters.

Collapsed Gibbs sampling, described by Thomas Griffiths and Mark Steyvers in their 2004 PNAS paper Finding scientific topics, is the most widely used inference technique in practice. The trick is to integrate out θ and φ analytically using the conjugacy of the Dirichlet and the categorical, leaving a sampler that operates only on the discrete topic assignments z. The conditional probability that word n in document d is assigned to topic k depends on the current count of how often word w_n has been assigned to topic k across the corpus and how often topic k has been used in document d. Each iteration sweeps through every word and resamples its topic. The MALLET implementation built by Andrew McCallum and collaborators is widely regarded as the gold-standard collapsed Gibbs sampler.

Online variational Bayes, introduced by Matthew Hoffman, David Blei, and Francis Bach in their 2010 NeurIPS paper Online Learning for Latent Dirichlet Allocation, processes mini-batches of documents and applies a stochastic gradient step on the variational parameters. The 2010 paper showed convergence on 3.3 million Wikipedia articles in a single pass, where batch variational inference would have taken days for one iteration. This algorithm later generalised into stochastic variational inference (Hoffman, Blei, Wang, and Paisley, JMLR 2013), which extends the same trick to a wide class of conjugate exponential family models.

For users who want full Bayesian uncertainty rather than point estimates, probabilistic programming languages like Stan and PyMC support Hamiltonian Monte Carlo or NUTS samplers on the LDA posterior, although these are far slower than collapsed Gibbs.

Hyperparameters

Three numbers govern the behaviour of an LDA model and must be chosen by the practitioner. The number of topics K is the most consequential. A model with K equal to ten will produce broad, sometimes vague topics, while K equal to several hundred will yield narrow, sometimes overlapping topics. The Hierarchical Dirichlet Process (Teh, Jordan, Beal, and Blei, JASA 2006) sidesteps this by placing a nonparametric prior that lets K grow with the corpus, but in practice many users still pick K by hand or by sweeping over candidate values and inspecting topic coherence.

The document-topic concentration α controls how sparse each document's topic distribution is. Values below one (often 50/K or 1/K are common defaults) push each document to use only a handful of topics. The topic-word concentration β has the same effect on the per-topic word distribution: small β yields topics dominated by a few characteristic words, large β yields more diffuse topics. Some implementations, including MALLET and Hoffman's online LDA, learn α and β automatically rather than fixing them.

Output and interpretation

A fitted topic model produces two main artefacts. The first is the topic-word matrix φ, where each row is a probability distribution over the vocabulary. The conventional way to inspect a topic is to print its top ten or twenty words. A topic with top words neural, network, learning, deep, training, layer, gradient, model, weights, loss is recognisably about deep learning even though the model has no idea what the phrase "deep learning" means. The second is the document-topic matrix θ, which gives each document's mixture proportions. These are the features most often used in downstream pipelines, for clustering, retrieval, or as inputs to a classifier.

Topic labels are assigned by humans. The model does not name its topics, and a typical analysis project includes a reading session where domain experts look at the top words of each topic and write a short label. Some topics are clear; others are uninterpretable junk. Tools like LDAvis (Sievert and Shirley, 2014) help by showing the marginal topic distribution and the distinctiveness of each topic's vocabulary.

Topic coherence is the standard quantitative metric for topic quality. Newman et al. (2010) proposed using normalised pointwise mutual information of the top words against an external corpus such as Wikipedia. David Mimno, Hanna Wallach, Edmund Talley, Miriam Leenders, and Andrew McCallum followed up in 2011 with Optimizing Semantic Coherence in Topic Models, which showed that an internal coherence metric computed only from the training corpus is a strong proxy for human judgement and can be used to identify and discard bad topics.

Use cases

Topic models are popular in any setting where someone has a large pile of unlabelled text or other count data and wants a quick map of the dominant themes. They are also useful when interpretability matters more than raw predictive accuracy, since each document is described by a mixture of named topics rather than an opaque embedding.

Application area	Typical use
Document exploration	Mapping the themes in an unfamiliar corpus
Document clustering	Group documents by dominant topic
Information retrieval	Topic-based ranking, query expansion
Recommender systems	Match users to items via topic preferences in a recommender system
Trend analysis	Track how topics rise and fall over time using a Dynamic Topic Model
Author analysis	Profile authors by their topic mixtures (author-topic models)
Scientific literature	Griffiths and Steyvers analysed PNAS abstracts; Blei and Lafferty did the same for Science
News organisation	Cluster headlines, surface emerging stories
Customer reviews	Extract themes from product or restaurant reviews
Social media	Detect topics in tweets and forum posts
Genomics	Topics over gene-expression data approximate biological pathways
Computer vision	Bag-of-visual-words topic models for image categorisation
Music	Latent genre discovery from playlists and tags

Topic features (the θ vector for each document) are also commonly used as inputs to a downstream supervised classifier. This was a standard pre-deep-learning recipe for text classification: train an LDA model on the corpus, then train a logistic regression on the resulting topic vectors.

Topic modelling sits at the intersection of dimensionality reduction, mixture models, and matrix factorisation. The same basic data, a document-by-word count matrix, can be decomposed in many different ways, each with its own assumptions and trade-offs.

Technique	Probabilistic?	Handles new documents	Notes
LSA / LSI (1990)	No	Yes (project new doc)	SVD of TF-IDF; first latent variable approach to LSA
pLSA (1999)	Yes (mixture only)	No (no doc-level prior)	Trained with EM; overfits on long corpora
LDA (2003)	Yes (full Bayesian)	Yes	Dirichlet priors on doc-topic and topic-word
NMF	No	Yes (project)	Non-negative matrix factorisation; fast deterministic alternative
HDP (2006)	Yes (nonparametric)	Yes	Like LDA but K is inferred automatically
CTM (2007)	Yes	Yes	Logistic-normal allows topic correlations
Top2Vec (2020)	No (embedding)	Yes	Joint embedding of words, docs, topics
BERTopic (2022)	No (embedding)	Yes	UMAP + HDBSCAN on transformer embeddings; uses class-based TF-IDF for topic words
Word2Vec / contextual embeddings	No	Yes	Word-level, not a doc-level mixture; not a topic model in the strict sense
LLM topic discovery	No	Yes	Prompt a large language model like GPT-4 to extract topics

LSA, the 1990 precursor, applies a singular value decomposition to a TF-IDF matrix and keeps the top k singular components. It is fast and deterministic but has no probabilistic interpretation, so quantities like "the probability of word w under topic k" do not make sense. pLSA fixed this by introducing a multinomial mixture but lacked a prior over the per-document topic distribution, which meant the model could not assign topic mixtures to documents it had not seen during training and tended to overfit. LDA closed both gaps with the Dirichlet priors and became the dominant model.

Non-negative matrix factorisation (NMF) decomposes the document-word matrix into two non-negative matrices. The result looks very similar to a topic model, with one matrix giving topic proportions per document and the other giving word weights per topic, and on many corpora NMF and LDA produce comparable topics. NMF is faster and simpler to implement but has no probabilistic interpretation and no natural way to handle new documents in a streaming setting.

Software libraries

Library	Language	Notes
Gensim	Python	LDA, online LDA, HDP, NMF, MALLET wrapper; the most popular Python topic modelling library
scikit-learn	Python	`LatentDirichletAllocation` (online VB) and `NMF`
MALLET	Java	Gold-standard collapsed Gibbs sampler from Andrew McCallum's group at UMass
BERTopic	Python	Maarten Grootendorst's transformer-based topic model
Top2Vec	Python	Dimo Angelov's joint document, word, topic embedding
tomotopy	C++ with Python bindings	Fast Gibbs sampling; supports many LDA variants
Stan / PyMC	Python, R	Full Bayesian inference for users who want posterior uncertainty
Mallet via R `mallet`	R	Bridge for R users to MALLET
`stm`	R	Structural Topic Model used in social science
ETM	Python	Reference implementation of Dieng et al. 2020 Embedded Topic Model

MALLET, written in the early 2000s and maintained by McCallum's group, is still cited as the most accurate practical implementation. Gensim, written by Radim Rehurek, is the most widely deployed because of its Python ergonomics, online training, and integration with the rest of the Python data stack.

Strengths

Topic models are unusually interpretable for a machine learning method. A trained model can be summarised by printing the top words of each topic, and a domain expert can usually agree or disagree with each cluster on inspection. They are unsupervised, so they work on raw text without annotation. They scale to corpora of millions of documents thanks to online VB and Gibbs sampling. The probabilistic formulation is theoretically principled and lends itself to extensions: dynamic topic models, supervised LDA, author-topic models, and the embedded topic model are all small modifications of the same core graphical model.

This interpretability is the main reason topic models survive in 2026 despite the dominance of transformer representations. A neural document embedding from a large LLM gives better predictive performance on most downstream tasks, but it is hard to look at one and explain to a non-technical reader what the model has learned. Topic models give you a vocabulary you can argue about.

Weaknesses

The bag-of-words assumption discards word order and syntax, which is a real cost on tasks where phrasing matters. The need to choose K is a perpetual annoyance, partly addressed by HDP and partly by sweeping over coherence. Topic quality varies: even on a clean corpus a fitted LDA model often contains a handful of uninterpretable junk topics that mostly capture stop-word patterns or scanning artefacts. Hyperparameter sensitivity is real, and small changes to α, β, the vocabulary cutoff, or the random seed can produce noticeably different topics.

For very large corpora, batch inference is slow. Online LDA helps, but extreme-scale text analysis has largely moved to neural embedding methods. For downstream prediction tasks, fine-tuned BERT or T5 models almost always beat LDA features. Topic models are rarely competitive when raw accuracy is the only goal.

Modern relevance

Despite the rise of transformer language models, topic models remain useful in three settings. First, exploratory analysis of unfamiliar corpora, where interpretable themes matter more than predictive accuracy. A historian, a journalist, or a market researcher with a million documents and no labels still reaches for LDA or BERTopic. Second, social science and digital humanities, where the field has built methodology and norms around topic-model output, and where the interpretability of the topics is the analytical product. Third, content analysis pipelines at companies that need a stable, low-cost summary of incoming text such as customer reviews or support tickets.

The modern hybrid approach, embodied by BERTopic and Top2Vec, replaces the bag-of-words representation with sentence-transformer embeddings and replaces the Dirichlet generative model with density-based clustering on the embedding space. BERTopic uses sentence-transformer embeddings, reduces dimensionality with UMAP, clusters with HDBSCAN, and then describes each cluster with a class-based TF-IDF over the documents in that cluster. The result feels like a topic model to a user, complete with top-word lists per topic, but the underlying machinery is closer to clustering than to Bayesian inference. The same general idea drives Top2Vec.

In the broader machine learning curriculum, LDA is still taught as the canonical example of a hierarchical Bayesian model with latent variables and conjugate priors. The graphical model in plate notation appears in nearly every introduction to probabilistic machine learning.

Recent developments

From roughly 2023 onward, several lines of work have emerged that combine topic models with large language models. One is using an LLM as a coherence judge, replacing the older NPMI metric with prompts that ask the model to rate the interpretability of a top-word list. A related line uses LLMs to label topics automatically, turning a list of top words into a short human-readable description. A third uses LLMs as a topic discovery method in their own right, prompting the model to read a sample of documents and produce a topic list directly, sometimes followed by a topic assignment step on the rest of the corpus. These approaches sometimes outperform LDA on coherence and labelling but raise familiar issues of cost, reproducibility, and prompt sensitivity.

Classical topic models continue to dominate in social science and humanities, where the field values reproducibility, interpretability, and the ability to publish a model that other researchers can refit on their own corpora. Several survey papers from 2024 and 2025 argue for hybrid pipelines: use BERTopic or an LLM for initial topic discovery, then validate and refine with classical coherence metrics and human review.

References

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990). "Indexing by Latent Semantic Analysis." *Journal of the American Society for Information Science*, 41(6): 391-407.
Hofmann, T. (1999). "Probabilistic Latent Semantic Indexing." In *Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval*, pp. 50-57.
Hofmann, T. (1999). "Probabilistic Latent Semantic Analysis." In *Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI)*.
Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). "Latent Dirichlet Allocation." *Journal of Machine Learning Research*, 3: 993-1022.
Griffiths, T. L., and Steyvers, M. (2004). "Finding scientific topics." *Proceedings of the National Academy of Sciences*, 101(suppl. 1): 5228-5235.
Rosen-Zvi, M., Griffiths, T., Steyvers, M., and Smyth, P. (2004). "The Author-Topic Model for Authors and Documents." In *Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence (UAI)*, pp. 487-494.
Blei, D. M., and Lafferty, J. D. (2006). "Dynamic Topic Models." In *Proceedings of the 23rd International Conference on Machine Learning (ICML)*, pp. 113-120.
Li, W., and McCallum, A. (2006). "Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations." In *Proceedings of the 23rd International Conference on Machine Learning (ICML)*.
Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2006). "Hierarchical Dirichlet Processes." *Journal of the American Statistical Association*, 101(476): 1566-1581.
Blei, D. M., and Lafferty, J. D. (2007). "A Correlated Topic Model of *Science*." *Annals of Applied Statistics*, 1(1): 17-35.
Blei, D. M., and McAuliffe, J. D. (2008). "Supervised Topic Models." In *Advances in Neural Information Processing Systems 20 (NIPS 2007)*.
Hoffman, M. D., Blei, D. M., and Bach, F. (2010). "Online Learning for Latent Dirichlet Allocation." In *Advances in Neural Information Processing Systems 23 (NeurIPS)*.
Newman, D., Lau, J. H., Grieser, K., and Baldwin, T. (2010). "Automatic Evaluation of Topic Coherence." In *Proceedings of NAACL HLT 2010*.
Mimno, D., Wallach, H. M., Talley, E., Leenders, M., and McCallum, A. (2011). "Optimizing Semantic Coherence in Topic Models." In *Proceedings of EMNLP 2011*, pp. 262-272.
Blei, D. M. (2012). "Probabilistic Topic Models." *Communications of the ACM*, 55(4): 77-84.
Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. (2013). "Stochastic Variational Inference." *Journal of Machine Learning Research*, 14: 1303-1347.
Sievert, C., and Shirley, K. (2014). "LDAvis: A method for visualizing and interpreting topics." In *Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces*.
Dieng, A. B., Ruiz, F. J. R., and Blei, D. M. (2020). "Topic Modeling in Embedding Spaces." *Transactions of the Association for Computational Linguistics*, 8: 439-453.
Angelov, D. (2020). "Top2Vec: Distributed Representations of Topics." arXiv:2008.09470.
Grootendorst, M. (2022). "BERTopic: Neural topic modeling with a class-based TF-IDF procedure." arXiv:2203.05794.
Rehurek, R., and Sojka, P. (2010). "Software Framework for Topic Modelling with Large Corpora." In *Proceedings of the LREC Workshop on New Challenges for NLP Frameworks*. Gensim documentation: https://radimrehurek.com/gensim/
McCallum, A. K. (2002). "MALLET: A Machine Learning for Language Toolkit." http://mallet.cs.umass.edu/

Topic model

Definition and intuition

Latent Dirichlet Allocation

History and timeline

Inference techniques

Hyperparameters

Output and interpretation

Use cases

Software libraries

Strengths

Weaknesses

Modern relevance

Recent developments

See also

References

Improve this article

Definition and intuition

Latent Dirichlet Allocation

History and timeline

Inference techniques

Hyperparameters

Output and interpretation

Use cases

Software libraries

Strengths

Weaknesses

Modern relevance

Recent developments

See also

References

Definition and intuition

Latent Dirichlet Allocation

History and timeline

Inference techniques

Hyperparameters

Output and interpretation

Use cases

Comparison with related techniques

Software libraries

Strengths

Weaknesses

Modern relevance

Recent developments

See also

References

Improve this article

Related Articles

Latent Dirichlet allocation

DBSCAN

Probabilistic Regression Model

Bayesian statistics

Agentic Context Engineering

Claude Sonnet 4.5

Definition and intuition

Latent Dirichlet Allocation

History and timeline

Inference techniques

Hyperparameters

Output and interpretation

Use cases

Comparison with related techniques

Software libraries

Strengths

Weaknesses

Modern relevance

Recent developments

See also

References

Related Articles

Latent Dirichlet allocation

DBSCAN

Probabilistic Regression Model

Bayesian statistics

Agentic Context Engineering

Claude Sonnet 4.5