Topic model
Last reviewed
May 1, 2026
Sources
22 citations
Review status
Source-backed
Revision
v1 · 3,952 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 1, 2026
Sources
22 citations
Review status
Source-backed
Revision
v1 · 3,952 words
Add missing citations, update stale details, or suggest a clearer explanation.
A topic model is a statistical model for discovering the abstract "topics" that occur in a collection of documents. Each document is modelled as a mixture over a small number of latent topics, and each topic is modelled as a probability distribution over words in the vocabulary. Topic models are a form of unsupervised learning: the topics are not specified in advance and no training labels are provided. They emerge from patterns of word co-occurrence across the corpus, and a human reader inspects the high-probability words of each topic to assign an interpretive label such as "genetics", "foreign policy", or "deep learning".
The modern field begins with three papers that are now standard reading: the singular value decomposition approach of Deerwester, Dumais, Furnas, Landauer, and Harshman in Indexing by Latent Semantic Analysis (1990), the probabilistic mixture model of Hofmann in Probabilistic Latent Semantic Indexing (1999), and the fully Bayesian generative model of Blei, Ng, and Jordan in Latent Dirichlet Allocation (2003). LDA in particular has been cited tens of thousands of times and remains the canonical example of a probabilistic topic model. A more recent line of work (BERTopic, Top2Vec, the Embedded Topic Model) replaces the bag-of-words representation with neural embeddings while keeping the same basic goal of producing interpretable topic descriptions.
Given a corpus of D documents drawn from a vocabulary of V words, a topic model assumes a small number K of latent topics. Each topic k is described by a distribution φ_k over the V words, so a topic about machine learning might place high probability on "network", "training", "model", "learning", and "neural". Each document d is described by a distribution θ_d over the K topics, so an article about a new convolutional architecture might be 70 percent "machine learning", 20 percent "computer vision", and 10 percent "hardware". The observed words of the document are generated by repeatedly drawing a topic from θ_d and then drawing a word from the chosen topic's distribution φ_k.
Topic models are deliberately simple in two ways. First, they treat documents as a bag of words, throwing away word order and grammar. Second, they assume a fixed vocabulary and a fixed number of topics. These simplifications make inference tractable on millions of documents while still recovering coherent themes, which is why LDA-style models remain in production use after twenty years even though deep learning has otherwise eaten the surrounding field of natural language processing.
Latent Dirichlet Allocation (LDA), introduced by David Blei, Andrew Ng, and Michael Jordan in the Journal of Machine Learning Research in 2003, is the canonical topic model and the starting point for almost every later variant. LDA is a generative model: it specifies a probabilistic story for how each word in each document was produced.
The generative process for a corpus of D documents and K topics is as follows. For each topic k in 1..K, draw a word distribution φ_k from a Dirichlet prior with concentration parameter β. For each document d in 1..D, draw a topic distribution θ_d from a Dirichlet prior with concentration parameter α. Then for each of the N_d word positions in document d, draw a topic assignment z_n from a categorical distribution with parameters θ_d, and draw the observed word w_n from the categorical distribution φ_{z_n}. The only observed quantities are the words w_n; the topic assignments z, the per-document topic distributions θ, and the per-topic word distributions φ are all latent and must be inferred.
The Dirichlet prior is the workhorse of the model. It is the conjugate prior of the categorical distribution, which makes posterior updates tractable, and its concentration parameters control sparsity. A small α (less than one) pushes each document toward only a few topics, while a large α spreads probability across many topics. The same logic applies to β for the topic-word distribution. Posterior inference produces an estimate of the joint distribution over θ, φ, and z given the observed words.
The development of topic models stretches across statistics, information retrieval, and machine learning, with each step adding a structural assumption or relaxing one. The Bayesian phase that LDA opened in 2003 produced more than a decade of variants, and the embedding phase that started around 2018 is still in progress.
| Year | Model | Authors | Contribution |
|---|---|---|---|
| 1990 | LSA / LSI | Deerwester, Dumais, Furnas, Landauer, Harshman | SVD of TF-IDF matrix; not probabilistic but the conceptual precursor |
| 1999 | pLSA / pLSI | Hofmann | Probabilistic mixture model trained with EM |
| 2003 | LDA | Blei, Ng, Jordan | Fully Bayesian generative model with Dirichlet priors |
| 2004 | Author-Topic Model | Rosen-Zvi, Griffiths, Steyvers, Smyth | Each author has a topic distribution; document is a mixture of authors |
| 2004 | Collapsed Gibbs sampling for LDA | Griffiths, Steyvers | Practical MCMC inference; Finding scientific topics in PNAS |
| 2006 | Dynamic Topic Models | Blei, Lafferty | Topics evolve over time via state-space model |
| 2006 | Pachinko Allocation (PAM) | Li, McCallum | DAG-structured correlations among topics |
| 2006 | Hierarchical Dirichlet Process | Teh, Jordan, Beal, Blei | Nonparametric prior; K inferred automatically |
| 2007 | Correlated Topic Model | Blei, Lafferty | Logistic-normal prior allows correlated topics |
| 2008 | Supervised LDA (sLDA) | Blei, McAuliffe | Document-level response variable guides topic discovery |
| 2010 | Online LDA | Hoffman, Blei, Bach | Stochastic variational inference; scales to streaming corpora |
| 2013 | Stochastic Variational Inference | Hoffman, Blei, Wang, Paisley | General SVI framework for conjugate exponential families |
| 2020 | Embedded Topic Model (ETM) | Dieng, Ruiz, Blei | Combines LDA with word embeddings |
| 2020 | Top2Vec | Angelov | Joint embedding of words, documents, and topics |
| 2022 | BERTopic | Grootendorst | Transformer embeddings clustered with UMAP and HDBSCAN |
The LDA posterior is intractable in closed form because the topic assignments z and the continuous variables θ and φ are entangled, so practical implementations rely on approximate inference. The two dominant families are variational methods, which fit a tractable distribution to the true posterior, and sampling methods, which draw from it.
Variational EM was the original inference algorithm proposed by Blei, Ng, and Jordan in the 2003 paper. It introduces a fully factorised variational distribution over θ, φ, and z and minimises the KL divergence to the true posterior by coordinate ascent, using an EM-like outer loop to update the corpus-level parameters.
Collapsed Gibbs sampling, described by Thomas Griffiths and Mark Steyvers in their 2004 PNAS paper Finding scientific topics, is the most widely used inference technique in practice. The trick is to integrate out θ and φ analytically using the conjugacy of the Dirichlet and the categorical, leaving a sampler that operates only on the discrete topic assignments z. The conditional probability that word n in document d is assigned to topic k depends on the current count of how often word w_n has been assigned to topic k across the corpus and how often topic k has been used in document d. Each iteration sweeps through every word and resamples its topic. The MALLET implementation built by Andrew McCallum and collaborators is widely regarded as the gold-standard collapsed Gibbs sampler.
Online variational Bayes, introduced by Matthew Hoffman, David Blei, and Francis Bach in their 2010 NeurIPS paper Online Learning for Latent Dirichlet Allocation, processes mini-batches of documents and applies a stochastic gradient step on the variational parameters. The 2010 paper showed convergence on 3.3 million Wikipedia articles in a single pass, where batch variational inference would have taken days for one iteration. This algorithm later generalised into stochastic variational inference (Hoffman, Blei, Wang, and Paisley, JMLR 2013), which extends the same trick to a wide class of conjugate exponential family models.
For users who want full Bayesian uncertainty rather than point estimates, probabilistic programming languages like Stan and PyMC support Hamiltonian Monte Carlo or NUTS samplers on the LDA posterior, although these are far slower than collapsed Gibbs.
Three numbers govern the behaviour of an LDA model and must be chosen by the practitioner. The number of topics K is the most consequential. A model with K equal to ten will produce broad, sometimes vague topics, while K equal to several hundred will yield narrow, sometimes overlapping topics. The Hierarchical Dirichlet Process (Teh, Jordan, Beal, and Blei, JASA 2006) sidesteps this by placing a nonparametric prior that lets K grow with the corpus, but in practice many users still pick K by hand or by sweeping over candidate values and inspecting topic coherence.
The document-topic concentration α controls how sparse each document's topic distribution is. Values below one (often 50/K or 1/K are common defaults) push each document to use only a handful of topics. The topic-word concentration β has the same effect on the per-topic word distribution: small β yields topics dominated by a few characteristic words, large β yields more diffuse topics. Some implementations, including MALLET and Hoffman's online LDA, learn α and β automatically rather than fixing them.
A fitted topic model produces two main artefacts. The first is the topic-word matrix φ, where each row is a probability distribution over the vocabulary. The conventional way to inspect a topic is to print its top ten or twenty words. A topic with top words neural, network, learning, deep, training, layer, gradient, model, weights, loss is recognisably about deep learning even though the model has no idea what the phrase "deep learning" means. The second is the document-topic matrix θ, which gives each document's mixture proportions. These are the features most often used in downstream pipelines, for clustering, retrieval, or as inputs to a classifier.
Topic labels are assigned by humans. The model does not name its topics, and a typical analysis project includes a reading session where domain experts look at the top words of each topic and write a short label. Some topics are clear; others are uninterpretable junk. Tools like LDAvis (Sievert and Shirley, 2014) help by showing the marginal topic distribution and the distinctiveness of each topic's vocabulary.
Topic coherence is the standard quantitative metric for topic quality. Newman et al. (2010) proposed using normalised pointwise mutual information of the top words against an external corpus such as Wikipedia. David Mimno, Hanna Wallach, Edmund Talley, Miriam Leenders, and Andrew McCallum followed up in 2011 with Optimizing Semantic Coherence in Topic Models, which showed that an internal coherence metric computed only from the training corpus is a strong proxy for human judgement and can be used to identify and discard bad topics.
Topic models are popular in any setting where someone has a large pile of unlabelled text or other count data and wants a quick map of the dominant themes. They are also useful when interpretability matters more than raw predictive accuracy, since each document is described by a mixture of named topics rather than an opaque embedding.
| Application area | Typical use |
|---|---|
| Document exploration | Mapping the themes in an unfamiliar corpus |
| Document clustering | Group documents by dominant topic |
| Information retrieval | Topic-based ranking, query expansion |
| Recommender systems | Match users to items via topic preferences in a recommender system |
| Trend analysis | Track how topics rise and fall over time using a Dynamic Topic Model |
| Author analysis | Profile authors by their topic mixtures (author-topic models) |
| Scientific literature | Griffiths and Steyvers analysed PNAS abstracts; Blei and Lafferty did the same for Science |
| News organisation | Cluster headlines, surface emerging stories |
| Customer reviews | Extract themes from product or restaurant reviews |
| Social media | Detect topics in tweets and forum posts |
| Genomics | Topics over gene-expression data approximate biological pathways |
| Computer vision | Bag-of-visual-words topic models for image categorisation |
| Music | Latent genre discovery from playlists and tags |
Topic features (the θ vector for each document) are also commonly used as inputs to a downstream supervised classifier. This was a standard pre-deep-learning recipe for text classification: train an LDA model on the corpus, then train a logistic regression on the resulting topic vectors.
Topic modelling sits at the intersection of dimensionality reduction, mixture models, and matrix factorisation. The same basic data, a document-by-word count matrix, can be decomposed in many different ways, each with its own assumptions and trade-offs.
| Technique | Probabilistic? | Handles new documents | Notes |
|---|---|---|---|
| LSA / LSI (1990) | No | Yes (project new doc) | SVD of TF-IDF; first latent variable approach to LSA |
| pLSA (1999) | Yes (mixture only) | No (no doc-level prior) | Trained with EM; overfits on long corpora |
| LDA (2003) | Yes (full Bayesian) | Yes | Dirichlet priors on doc-topic and topic-word |
| NMF | No | Yes (project) | Non-negative matrix factorisation; fast deterministic alternative |
| HDP (2006) | Yes (nonparametric) | Yes | Like LDA but K is inferred automatically |
| CTM (2007) | Yes | Yes | Logistic-normal allows topic correlations |
| Top2Vec (2020) | No (embedding) | Yes | Joint embedding of words, docs, topics |
| BERTopic (2022) | No (embedding) | Yes | UMAP + HDBSCAN on transformer embeddings; uses class-based TF-IDF for topic words |
| Word2Vec / contextual embeddings | No | Yes | Word-level, not a doc-level mixture; not a topic model in the strict sense |
| LLM topic discovery | No | Yes | Prompt a large language model like GPT-4 to extract topics |
LSA, the 1990 precursor, applies a singular value decomposition to a TF-IDF matrix and keeps the top k singular components. It is fast and deterministic but has no probabilistic interpretation, so quantities like "the probability of word w under topic k" do not make sense. pLSA fixed this by introducing a multinomial mixture but lacked a prior over the per-document topic distribution, which meant the model could not assign topic mixtures to documents it had not seen during training and tended to overfit. LDA closed both gaps with the Dirichlet priors and became the dominant model.
Non-negative matrix factorisation (NMF) decomposes the document-word matrix into two non-negative matrices. The result looks very similar to a topic model, with one matrix giving topic proportions per document and the other giving word weights per topic, and on many corpora NMF and LDA produce comparable topics. NMF is faster and simpler to implement but has no probabilistic interpretation and no natural way to handle new documents in a streaming setting.
| Library | Language | Notes |
|---|---|---|
| Gensim | Python | LDA, online LDA, HDP, NMF, MALLET wrapper; the most popular Python topic modelling library |
| scikit-learn | Python | LatentDirichletAllocation (online VB) and NMF |
| MALLET | Java | Gold-standard collapsed Gibbs sampler from Andrew McCallum's group at UMass |
| BERTopic | Python | Maarten Grootendorst's transformer-based topic model |
| Top2Vec | Python | Dimo Angelov's joint document, word, topic embedding |
| tomotopy | C++ with Python bindings | Fast Gibbs sampling; supports many LDA variants |
| Stan / PyMC | Python, R | Full Bayesian inference for users who want posterior uncertainty |
Mallet via R mallet | R | Bridge for R users to MALLET |
stm | R | Structural Topic Model used in social science |
| ETM | Python | Reference implementation of Dieng et al. 2020 Embedded Topic Model |
MALLET, written in the early 2000s and maintained by McCallum's group, is still cited as the most accurate practical implementation. Gensim, written by Radim Rehurek, is the most widely deployed because of its Python ergonomics, online training, and integration with the rest of the Python data stack.
Topic models are unusually interpretable for a machine learning method. A trained model can be summarised by printing the top words of each topic, and a domain expert can usually agree or disagree with each cluster on inspection. They are unsupervised, so they work on raw text without annotation. They scale to corpora of millions of documents thanks to online VB and Gibbs sampling. The probabilistic formulation is theoretically principled and lends itself to extensions: dynamic topic models, supervised LDA, author-topic models, and the embedded topic model are all small modifications of the same core graphical model.
This interpretability is the main reason topic models survive in 2026 despite the dominance of transformer representations. A neural document embedding from a large LLM gives better predictive performance on most downstream tasks, but it is hard to look at one and explain to a non-technical reader what the model has learned. Topic models give you a vocabulary you can argue about.
The bag-of-words assumption discards word order and syntax, which is a real cost on tasks where phrasing matters. The need to choose K is a perpetual annoyance, partly addressed by HDP and partly by sweeping over coherence. Topic quality varies: even on a clean corpus a fitted LDA model often contains a handful of uninterpretable junk topics that mostly capture stop-word patterns or scanning artefacts. Hyperparameter sensitivity is real, and small changes to α, β, the vocabulary cutoff, or the random seed can produce noticeably different topics.
For very large corpora, batch inference is slow. Online LDA helps, but extreme-scale text analysis has largely moved to neural embedding methods. For downstream prediction tasks, fine-tuned BERT or T5 models almost always beat LDA features. Topic models are rarely competitive when raw accuracy is the only goal.
Despite the rise of transformer language models, topic models remain useful in three settings. First, exploratory analysis of unfamiliar corpora, where interpretable themes matter more than predictive accuracy. A historian, a journalist, or a market researcher with a million documents and no labels still reaches for LDA or BERTopic. Second, social science and digital humanities, where the field has built methodology and norms around topic-model output, and where the interpretability of the topics is the analytical product. Third, content analysis pipelines at companies that need a stable, low-cost summary of incoming text such as customer reviews or support tickets.
The modern hybrid approach, embodied by BERTopic and Top2Vec, replaces the bag-of-words representation with sentence-transformer embeddings and replaces the Dirichlet generative model with density-based clustering on the embedding space. BERTopic uses sentence-transformer embeddings, reduces dimensionality with UMAP, clusters with HDBSCAN, and then describes each cluster with a class-based TF-IDF over the documents in that cluster. The result feels like a topic model to a user, complete with top-word lists per topic, but the underlying machinery is closer to clustering than to Bayesian inference. The same general idea drives Top2Vec.
In the broader machine learning curriculum, LDA is still taught as the canonical example of a hierarchical Bayesian model with latent variables and conjugate priors. The graphical model in plate notation appears in nearly every introduction to probabilistic machine learning.
From roughly 2023 onward, several lines of work have emerged that combine topic models with large language models. One is using an LLM as a coherence judge, replacing the older NPMI metric with prompts that ask the model to rate the interpretability of a top-word list. A related line uses LLMs to label topics automatically, turning a list of top words into a short human-readable description. A third uses LLMs as a topic discovery method in their own right, prompting the model to read a sample of documents and produce a topic list directly, sometimes followed by a topic assignment step on the rest of the corpus. These approaches sometimes outperform LDA on coherence and labelling but raise familiar issues of cost, reproducibility, and prompt sensitivity.
Classical topic models continue to dominate in social science and humanities, where the field values reproducibility, interpretability, and the ability to publish a model that other researchers can refit on their own corpora. Several survey papers from 2024 and 2025 argue for hybrid pipelines: use BERTopic or an LLM for initial topic discovery, then validate and refine with classical coherence metrics and human review.