TF-IDF, short for term frequency-inverse document frequency, is a numerical statistic that reflects how important a word is to a document within a larger collection or corpus. It is the product of two terms. The first, term frequency (TF), counts how often a word appears in a single document and rewards repeated usage as a signal of topical relevance. The second, inverse document frequency (IDF), discounts words that occur in many documents because they tend to be generic and uninformative. Multiplying these two quantities yields a score that is high for words that are both frequent in a particular document and rare in the corpus as a whole. TF-IDF is one of the oldest and most widely used weighting schemes in natural_language_processing and information retrieval, and despite the rise of dense neural representations it remains a default baseline for ranking, classification, and feature engineering.
The scheme has a long pedigree. The notion of inverse document frequency was introduced by Karen Sparck Jones in a 1972 paper titled A Statistical Interpretation of Term Specificity and Its Application in Retrieval in the Journal of Documentation. Gerard Salton and Christopher Buckley consolidated the term-weighting literature in their 1988 paper Term-weighting approaches in automatic text retrieval in Information Processing and Management, where they catalogued the variants of TF and IDF and established empirical guidelines that are still cited decades later. Salton's earlier vector space model, implemented in the SMART system at Cornell, paired TF-IDF weights with cosine similarity and became the dominant ranking framework for a generation of search systems. The classical reference textbook is Christopher Manning, Prabhakar Raghavan, and Hinrich Schutze's Introduction to Information Retrieval (Cambridge University Press, 2008), whose chapter on scoring devotes substantial space to TF-IDF and its SMART notation.
Let a corpus consist of $N$ documents indexed by $d$, and let $t$ denote a term (typically a word, but possibly a phrase or character n-gram). Two basic statistics are defined.
Term frequency $\text{tf}(t, d)$ counts the number of times term $t$ occurs in document $d$. The simplest definition is the raw count, but several normalizations are common in practice.
Document frequency $\text{df}(t)$ counts the number of documents in the corpus that contain term $t$ at least once. The inverse document frequency is then defined as
$$\text{idf}(t) = \log \frac{N}{\text{df}(t)}$$
where the logarithm is usually taken base 2 or base 10 (the choice is immaterial because it scales all weights uniformly). The intuition is information-theoretic: if a term occurs in only a small fraction of documents, observing it in a query carries a great deal of evidence about which document is intended, whereas a term that occurs in nearly every document carries almost none.
The TF-IDF weight is the product
$$\text{tf-idf}(t, d) = \text{tf}(t, d) \cdot \text{idf}(t).$$
A document is then represented as a vector whose components are the TF-IDF weights of all terms in the vocabulary. Most vocabulary entries do not appear in any given document, so these vectors are extremely sparse and are stored as lists of (term index, weight) pairs.
Consider a corpus of $N = 1{,}000{,}000$ news articles. The word the appears in essentially all of them, so $\text{df}(\text{the}) \approx 1{,}000{,}000$ and $\text{idf}(\text{the}) = \log(10^6 / 10^6) = 0$. The word transformer might appear in $5{,}000$ documents, giving $\text{idf}(\text{transformer}) = \log(10^6 / 5{,}000) \approx 7.64$ in base 2 or $2.30$ in base 10. The word embiggens might appear in only $10$ documents, giving an IDF of $16.6$ (base 2). A document that mentions transformer five times receives a TF-IDF score of $5 \cdot 7.64 = 38.2$ for that term, whereas mentioning the five times still scores $0$. The function thus elevates rare, content-bearing words while suppressing frequent function words.
The basic recipe admits many variations, and the choice of variant matters for retrieval and classification accuracy. The standard SMART notation, introduced by Salton and refined by Buckley, encodes a TF-IDF scheme as a triplet of letters: the term frequency variant, the document frequency variant, and the normalization. A document weighting and a query weighting are then written together as ddd.qqq. The following table summarizes the most common choices.
| Variant | Symbol | Formula | Rationale |
|---|---|---|---|
| Raw count | n (natural) | $\text{tf}(t, d)$ | Simplest baseline, sensitive to document length |
| Boolean | b (boolean) | $1$ if term occurs, $0$ otherwise | Ignores how often a term repeats |
| Logarithmic | l (log) | $1 + \log(\text{tf}(t, d))$ if $\text{tf} > 0$, else $0$ | Prevents very frequent terms from dominating |
| Augmented | a (augmented) | $0.5 + 0.5 \cdot \text{tf}(t, d) / \max_{t'} \text{tf}(t', d)$ | Normalizes by the most frequent term in the document, useful for long documents |
| Double normalization K | (variant of a) | $K + (1 - K) \cdot \text{tf}(t, d) / \max_{t'} \text{tf}(t', d)$ | Generalization of augmented TF with adjustable floor |
| Sublinear (sklearn) | (similar to l) | $1 + \ln(\text{tf}(t, d))$ | Implemented in scikit-learn as sublinear_tf=True |
For the document frequency factor, the variants are fewer but equally important.
| Variant | Formula | Rationale |
|---|---|---|
| No IDF (n) | $1$ | Treat all terms equally; rarely useful in retrieval |
| Standard IDF (t) | $\log(N / \text{df}(t))$ | The original Sparck Jones formulation |
| Smoothed IDF | $\log((1 + N) / (1 + \text{df}(t))) + 1$ | Avoids division by zero, used by scikit-learn |
| Probabilistic IDF (p) | $\max(0, \log((N - \text{df}(t)) / \text{df}(t)))$ | Derived from the Robertson-Sparck Jones probabilistic model; clipped to zero for very common terms |
| Maximum IDF | $\log((1 + \max_t \text{df}(t)) / (1 + \text{df}(t)))$ | Normalizes against the most frequent term in the corpus |
Finally, vectors are usually normalized to control for differences in document length, since longer documents otherwise accumulate larger vectors and would dominate cosine similarity comparisons.
| Normalization | Formula | Rationale |
|---|---|---|
| None (n) | leave vector as is | Sensitive to length |
| Cosine (c) | divide by Euclidean norm $|v|_2$ | Equivalent to using cosine similarity; the standard choice |
| L1 | divide by $|v|_1$ | Useful when interpreting weights as a probability distribution |
| Pivoted unique normalization | $|v|_2 / ((1 - s) + s \cdot | \text{unique terms} |
A classical pairing in IR practice is the SMART code lnc.ltc. Documents use logarithmic TF, no IDF, and cosine normalization (because IDF only matters for the query side in a one-sided weighting scheme). Queries use logarithmic TF, standard IDF, and cosine normalization. This combination is recommended in Manning, Raghavan, and Schutze and is the default setting for many academic IR baselines.
TF-IDF is most often used inside the vector space model of information retrieval, introduced by Salton, Wong, and Yang in 1975. In this model, both documents and queries are mapped to vectors in a high-dimensional vocabulary space, with each axis corresponding to a unique term. The relevance of a document to a query is computed as the cosine of the angle between the document vector $\vec{d}$ and the query vector $\vec{q}$:
$$\text{sim}(\vec{q}, \vec{d}) = \frac{\vec{q} \cdot \vec{d}}{|\vec{q}|_2 \cdot |\vec{d}|_2}.$$
If both vectors are L2-normalized, this reduces to the dot product, which is fast to compute over sparse vectors using inverted indexes. Cosine similarity is invariant to vector scaling, so it focuses on direction (which terms appear together) rather than magnitude (how long the document is). For TF-IDF vectors, the cosine measures the overlap of distinctive vocabulary, weighted by how rare each shared term is in the corpus. This pairing of TF-IDF weights with cosine similarity over an inverted index was the workhorse of search engines from the 1970s through the early 2000s, and modern engines like Apache Lucene, Solr, and Elasticsearch retained TF-IDF scoring as their default ranking function for many years before switching to BM25 around 2015 to 2016.
BM25, formally Okapi BM25, is a probabilistic ranking function developed by Stephen Robertson, Karen Sparck Jones, and colleagues at City University London in the 1980s and 1990s. It is best understood as a refined cousin of TF-IDF that fixes two well known weaknesses of the basic formula. The BM25 score for a document $D$ given a query $Q$ is
$$\text{score}(D, Q) = \sum_{t \in Q} \text{idf}(t) \cdot \frac{\text{tf}(t, D) \cdot (k_1 + 1)}{\text{tf}(t, D) + k_1 \cdot (1 - b + b \cdot |D| / \text{avgdl})}$$
where $k_1$ controls term frequency saturation (typically $1.2$ to $2.0$), $b$ controls length normalization (typically $0.75$), and $\text{avgdl}$ is the average document length in the corpus. The IDF term is usually the probabilistic variant $\log((N - \text{df}(t) + 0.5) / (\text{df}(t) + 0.5))$, which can be negative for very common terms and is sometimes clipped to zero.
The two improvements over TF-IDF are easy to read off the formula. First, term frequency saturates: as $\text{tf}(t, D)$ grows, the contribution of term $t$ approaches an asymptote of $\text{idf}(t) \cdot (k_1 + 1)$, capturing the intuition that the tenth occurrence of transformer in a paper is much less informative than the first. Second, document length is normalized smoothly via the $b$ parameter, so a long document is not penalized as harshly as it would be by raw TF and not entirely free of length penalty either. The following table contrasts the two methods on the dimensions practitioners care about.
| Property | TF-IDF | BM25 |
|---|---|---|
| TF response | Linear (or logarithmic with sublinear TF) | Saturating, with parameter $k_1$ |
| Document length handling | None unless cosine normalized | Built in via parameter $b$ |
| IDF formulation | $\log(N / \text{df}(t))$ | $\log((N - \text{df} + 0.5) / (\text{df} + 0.5))$, possibly clipped |
| Tunable parameters | Few or none | $k_1$, $b$ |
| Theoretical grounding | Heuristic, vector space model | Probabilistic relevance model (Binary Independence) |
| Default in modern engines | Less common since around 2015 | Default in Lucene, Solr, Elasticsearch |
| Typical accuracy | Strong baseline | Usually slightly to moderately better than TF-IDF |
BM25 is now the default lexical ranking function in nearly every modern open source search engine, and TF-IDF is largely retained as a feature representation for downstream models rather than a standalone ranker. The two methods are close kin, however, and often produce similar rankings on short queries with rare terms.
In the Python ecosystem, the canonical implementation is sklearn.feature_extraction.text.TfidfVectorizer. It combines tokenization, vocabulary building, and TF-IDF weighting into a single transformer with sensible defaults. A typical use looks like
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
"The cat sat on the mat",
"Cats are excellent climbers",
"Dogs are loyal companions",
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())
print(X.toarray())
The vectorizer returns a sparse matrix of shape (n_documents, n_features). Several constructor parameters control the variant in use. The following table covers the most important ones.
| Parameter | Default | Effect |
|---|---|---|
lowercase | True | Convert all text to lowercase before tokenizing |
stop_words | None | Optionally remove a list of common words such as English function words |
ngram_range | (1, 1) | Range of n-grams to extract; (1, 2) adds bigrams |
max_df | 1.0 | Ignore terms that appear in more than this fraction of documents |
min_df | 1 | Ignore terms that appear in fewer than this many documents |
max_features | None | Cap vocabulary size at the top-N most frequent terms |
norm | 'l2' | Vector normalization: 'l1', 'l2', or None |
use_idf | True | Whether to apply the IDF factor |
smooth_idf | True | Use $\log((1 + N) / (1 + \text{df})) + 1$ to avoid divide-by-zero |
sublinear_tf | False | Replace $\text{tf}$ with $1 + \ln(\text{tf})$ when set to True |
analyzer | 'word' | Switch to 'char' or 'char_wb' for character n-grams |
token_pattern | regex | Regular expression that defines what counts as a token |
A few practical notes follow from these defaults. First, scikit-learn's IDF formula differs slightly from the textbook by adding 1 to both numerator and denominator (smoothing) and adding 1 to the result, so a term that appears in every document has IDF $1$ rather than $0$. Combined with L2 normalization, this means common terms still contribute, just less than rare ones. Second, sublinear_tf=True is recommended for long documents because it dampens the effect of repeated terms much like BM25 does. Third, fitting the vectorizer on the training set and only transforming the test set is essential to avoid leaking test-set vocabulary statistics into the model.
The related TfidfTransformer accepts a precomputed term-count matrix (for instance from CountVectorizer) and applies the IDF and normalization steps. Splitting these into two stages is convenient when the same counts feed multiple downstream models with different weighting schemes.
For decades, the standard pipeline for text classification was to convert documents to TF-IDF vectors and feed them into a linear classifier such as a support_vector_machine_svm, logistic regression, or multinomial naive Bayes. This combination is fast, interpretable, and surprisingly competitive on tasks like topic classification, sentiment analysis, spam detection, and authorship attribution. Thorsten Joachims's 1998 paper Text Categorization with Support Vector Machines established the linear SVM with TF-IDF features as the baseline that any new method had to beat, a status it retained for nearly twenty years until neural networks finally surpassed it on most benchmarks.
The pairing works for several reasons. TF-IDF vectors are high-dimensional and sparse, a regime in which linear models with appropriate regularization are statistically efficient and computationally cheap. The feature representation is interpretable: examining the largest positive and negative weights of a logistic regression trained on TF-IDF features tells you which words are most predictive of each class. The model also trains in seconds on millions of documents, which is essential for iterative experimentation. Spam filters in mail systems through the 2000s and 2010s relied heavily on TF-IDF features fed into naive Bayes or SVM classifiers, often updated continuously as new spam patterns appeared. Even today, many production text-classification systems use TF-IDF features as a fast first-stage model, with neural reranking applied only to ambiguous cases.
TF-IDF is a natural method for keyword extraction. To find the most distinctive terms in a document, compute the TF-IDF weights of all terms relative to a corpus and return the top-$k$ by weight. This works well when the corpus is large and topically diverse, because the IDF factor automatically downweights vocabulary that is generic across topics. The same idea underlies tag suggestion, search result snippets, and the construction of word clouds. For multi-document summarization, TF-IDF can score sentences by summing the weights of their constituent terms, then select the top-scoring sentences subject to a length budget and a redundancy penalty. This was the basis of the LexRank and TextRank algorithms before transformer-based summarizers became standard. TF-IDF features are also used to seed clustering algorithms like k-means for topic discovery, where each cluster centroid can be interpreted as a synthetic document whose top TF-IDF terms describe the cluster's theme.
Dense vector representations from neural models such as BERT, Sentence-BERT, and OpenAI's text embedding models have largely supplanted TF-IDF for tasks that require true semantic_search. Dense embeddings can match a query about automobile to a document about cars, recognize paraphrases, and bridge across languages, none of which TF-IDF can do because it treats car and automobile as unrelated atomic tokens. The vector space model has also evolved: instead of millions of sparse axes corresponding to vocabulary, dense models map text to a few hundred dense dimensions that encode latent semantic features. See word_embedding for the foundational neural representations of text.
For exact-keyword tasks, however, TF-IDF and its successor BM25 remain dominant. A search for a serial number, error code, drug name, gene symbol, or rare proper noun is best served by a system that knows that the exact spelling matters and that rarity in the corpus is informative, both of which TF-IDF captures naturally. This is why modern retrieval systems increasingly use hybrid retrieval, combining a sparse lexical retriever (TF-IDF or BM25) with a dense neural retriever and fusing the rankings, often via Reciprocal Rank Fusion (RRF). The lexical channel anchors rare or out-of-vocabulary terms that the neural channel may miss, while the neural channel handles synonymy, paraphrase, and contextual nuance. Studies of retrieval_augmented_generation systems have repeatedly found that hybrid retrieval outperforms either approach in isolation, especially on heterogeneous corpora and ambiguous queries.
| Aspect | TF-IDF | Dense embeddings |
|---|---|---|
| Representation | Sparse, vocabulary-aligned | Dense, learned |
| Dimensions | Thousands to millions | Hundreds to thousands |
| Synonymy and paraphrase | Cannot capture | Strong |
| Exact match on rare terms | Strong | Often weaker |
| Out-of-domain robustness | Strong (no training required) | Variable, depends on training data |
| Compute cost at index time | Cheap, deterministic | Requires GPU inference |
| Compute cost at query time | Inverted index lookup | Vector search (HNSW, IVF, etc.) |
| Interpretability | High (terms have weights) | Low (latent dimensions) |
| Multilinguality | Per-language vocabularies | Cross-lingual models possible |
| Best use case | Exact keyword retrieval, classification baselines | Semantic search, cross-lingual retrieval |
In machine learning pipelines that operate on tabular data with text fields, TF-IDF features are often used alongside numeric features in tree-based models like XGBoost or LightGBM. Because TF-IDF features are sparse and interpretable, they fit naturally into feature engineering workflows where the data scientist wants to understand which words drive predictions.
TF-IDF has several well known limitations that motivated the move to dense representations and to BM25. The bag-of-words assumption discards word order and grammatical structure. Sentences like the dog bit the man and the man bit the dog produce identical TF-IDF vectors, even though their meanings are reversed. Negation, modifiers, and compound noun phrases all suffer from this. Adding bigrams or trigrams partially addresses the problem but increases vocabulary size and sparsity.
TF-IDF has no notion of semantics. Two documents that discuss the same topic in entirely different vocabularies receive a near-zero similarity score, even though a human reader would recognize them as related. Synonyms, related concepts, hypernyms, and paraphrases are invisible to the algorithm. Latent semantic indexing (LSI), latent Dirichlet allocation (LDA), and modern embeddings were developed in part to solve this problem. Length normalization in basic TF-IDF is crude unless cosine or pivoted normalization is used. Raw TF-IDF scores grow with document length, so long documents tend to dominate retrieval results. BM25's smoothed length normalization handles this more gracefully.
TF-IDF assumes a fixed corpus. Adding new documents shifts the IDF values of every term, which can be inconvenient in streaming or dynamic settings. Approximate methods like hashed TF-IDF or incremental IDF updates exist but introduce their own trade-offs. Finally, TF-IDF is an unsupervised heuristic. It does not learn from labeled data, so it cannot exploit signal about which terms matter for a downstream task. Supervised feature weighting schemes such as supervised TF-IDF, delta-IDF, and learned sparse retrievers like SPLADE attempt to combine the interpretability and exact-match strengths of TF-IDF with task-specific learning.
Researchers have proposed many extensions to the basic TF-IDF formula, motivated by specific deficiencies or specific tasks. Delta TF-IDF weights terms by the difference between their IDF in two corpora, capturing the contrast between, say, a positive-sentiment corpus and a negative-sentiment corpus. It is useful for sentiment analysis and other contrastive tasks. Okapi BM25, discussed above, replaces linear TF with a saturating function and adds principled length normalization. BM25F extends BM25 to weighted document fields, useful when documents have structure (title, body, anchor text) and different fields should receive different weights. TF-PDF (Term Frequency-Proportional Document Frequency) is a variant used in topic detection and tracking, designed for streaming text where IDF is unstable.
Pivoted unique normalization corrects a length bias in cosine-normalized TF-IDF and was shown by Singhal, Buckley, and Mitra (1996) to improve retrieval effectiveness on long documents. Supervised TF-IDF and gain-weighted TF-IDF replace the IDF factor with a discriminative measure such as information gain, chi-squared statistic, or odds ratio, computed against labeled training data. Hashing TF-IDF (sometimes called feature hashing or the hashing trick) replaces the explicit vocabulary with a fixed-size hash table, trading a small amount of accuracy for the ability to process truly enormous corpora without storing a vocabulary. Learned sparse retrievers like SPLADE and uniCOIL use neural networks to predict sparse vectors with TF-IDF-like structure, combining the inverted index efficiency of TF-IDF with the semantic generalization of neural embeddings.
The key milestones in TF-IDF's development span more than half a century. In 1957, Hans Peter Luhn at IBM proposed using term frequency to identify keywords for automatic abstracting, in his paper A Statistical Approach to Mechanized Encoding and Searching of Literary Information. Luhn observed that the resolving power of significant words is highest at intermediate frequencies, neither very common nor very rare, an early hint of TF-IDF's combined logic.
In 1972, Karen Sparck Jones, working at the Cambridge Language Research Unit, published A Statistical Interpretation of Term Specificity and Its Application in Retrieval in the Journal of Documentation. The paper proposed weighting terms by their inverse document frequency and provided empirical evidence that this weighting improved retrieval. Sparck Jones did not call her quantity IDF in that paper, but the formula and its motivation are exactly what is now meant by IDF. Her work was foundational, and IDF is sometimes called the Sparck Jones weight.
Through the 1970s, Gerard Salton and his collaborators at Cornell developed the SMART information retrieval system, which combined TF and IDF weights with the vector space model and cosine similarity. The 1975 paper by Salton, Wong, and Yang, A Vector Space Model for Automatic Indexing, formalized the framework that is still taught in IR courses today. In 1988, Salton and Buckley published Term-weighting approaches in automatic text retrieval in Information Processing and Management. Drawing on years of experiments with the SMART system, the paper catalogued the variants of TF and IDF, introduced the SMART notation, and argued for specific combinations as best practice. It remains one of the most cited papers in information retrieval and is the source for many of the variant formulas listed above.
The Robertson and Sparck Jones probabilistic relevance model, developed through the 1970s and 1980s, generalized IDF into the probabilistic framework that eventually produced BM25. Stephen Robertson and Steve Walker's 1994 TREC papers introduced the BM25 formula explicitly, and by the 2000s BM25 had become the de facto improvement over basic TF-IDF. In the late 2000s and 2010s, the rise of word embeddings (word2vec in 2013, GloVe in 2014) and contextual embeddings (ELMo in 2018, BERT in 2018) shifted research attention from sparse weighted vectors to dense neural representations. By the early 2020s, the dominant approach to semantic retrieval was dense vector search using transformer-based encoders. TF-IDF and BM25 receded to the role of strong baselines and as components of hybrid retrieval systems.
Despite the dominance of dense embeddings in research, TF-IDF remains ubiquitous in production. The reasons are practical. TF-IDF is fast, deterministic, and requires no training data or GPU. Its memory footprint is small enough that a vocabulary of millions of terms fits comfortably in RAM, and inverted indexes deliver sub-millisecond query latency on modern hardware. Its scores are interpretable, which is essential in regulated industries like healthcare, law, and finance, where ranking decisions may need to be explained to auditors. It also works out of the box on any corpus in any language without fine-tuning.
A non-exhaustive list of present-day TF-IDF use cases includes the following. Internal enterprise search systems often use TF-IDF or BM25 as the primary ranker, sometimes with a neural reranker on the top results. Log search engines like Splunk and Elastic are dominated by lexical scoring because exact matches on error messages and identifiers are paramount. Code search inside IDEs frequently relies on TF-IDF over symbol names because semantic models struggle with rare identifiers and short queries. Document deduplication uses TF-IDF cosine similarity as a fast first-stage filter before more expensive comparisons. Plagiarism detection compares TF-IDF vectors to find suspiciously similar passages. Tag and category suggestion in CMS systems pulls top TF-IDF terms as suggested tags. Spam and content moderation pipelines use TF-IDF features inside lightweight linear classifiers as the first line of defense. The text fields of tabular data in Kaggle competitions are routinely converted to TF-IDF vectors and concatenated with numeric features for tree-based models. Educational software uses TF-IDF for short-answer scoring and for matching student responses to model answers.
For a working machine learning practitioner in 2026, the right mental model is straightforward. TF-IDF is a strong, simple, interpretable baseline that should always be tried first. If it solves the problem, the additional cost and complexity of a neural model may not be justified. If it does not, dense embeddings or a hybrid system are the next step. Even when a neural system is the final choice, TF-IDF often makes sense as a feature, a fast pre-filter, or a debugging tool for understanding why the neural model produces particular results.