# Bag of Words

> Source: https://aiwiki.ai/wiki/bag_of_words
> Updated: 2026-07-13
> Categories: Machine Learning, Natural Language Processing
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

The **bag of words** (BoW) model is a text representation method that converts a document into a fixed-length numeric vector by counting how often each word from a predefined vocabulary appears, while discarding grammar and word order. It is one of the oldest and most widely used ways to turn text into numbers for [machine learning](/wiki/machine_learning) and [natural language processing](/wiki/natural_language_processing), and it remains a default baseline in 2026 for classical [information retrieval](/wiki/information_retrieval), [naive Bayes](/wiki/naive_bayes) spam classifiers, and feature pipelines that feed [topic modeling](/wiki/topic_modeling) algorithms. The model takes its name from a 1954 observation by linguist Zellig Harris that "language is not merely a bag of words."[1]

In its most general statement, BoW commits to a fixed vocabulary $$V$$ of size $$\lvert V \rvert$$ and maps every document $$d$$ to a vector $$x \in \mathbb{R}^{\lvert V \rvert}$$, where component $$x_i$$ records how often word $$i$$ occurs in $$d$$. The exact mapping depends on the weighting scheme: raw counts, binary presence flags, normalized counts, [TF-IDF](/wiki/tf_idf) weights, or [BM25](/wiki/bm25) scores. Whatever syntax, ordering, and discourse structure the original text contained gets thrown out, leaving a single vector that summarizes which words appeared and how often. The surprise of BoW is how often that crude summary turns out to be enough.

## When was the bag of words model invented?

The conceptual roots of the bag of words model trace back to linguist Zellig Harris's 1954 article "Distributional Structure," which explored the idea that the distribution of words in context carries meaningful information about language. Harris notably observed that "language is not merely a bag of words but a tool with particular properties which have been fashioned in the course of its use," an ironic early usage of the phrase that would later lend its name to the model.[1] Harris's broader distributional hypothesis, which states that words occurring in similar contexts tend to have similar meanings, laid the theoretical groundwork for many statistical approaches to language, including BoW.

The explicit formalization of BoW as a numerical representation came from work in [information retrieval](/wiki/information_retrieval) during the 1960s and 1970s. Gerard Salton and his collaborators at Cornell developed the SMART Information Retrieval System, which represented documents and queries as numeric vectors over a vocabulary. Their 1975 paper "A Vector Space Model for Automatic Indexing," published in Communications of the ACM (volume 18, pages 613-620), set out the geometry that underlies modern retrieval: documents are points in a high-dimensional space, queries are also points, and similarity reduces to the cosine of the angle between two vectors.[2] The vocabulary defines the axes and BoW counts give the coordinates.

A key refinement followed in 1988 when Salton and Christopher Buckley published "Term-weighting approaches in automatic text retrieval," cataloging the family of TF-IDF style weights that would dominate text retrieval for the next quarter century.[3] Karen Sparck Jones had already published her 1972 paper introducing inverse document frequency as a way to scale term importance based on rarity.[4] The Okapi BM25 weighting function, developed by Stephen Robertson and his collaborators at City University in London during the early 1990s, became the standard probabilistic refinement of TF-IDF and is still the default ranker in many open-source search engines today.[5]

The bag of words model gained practical traction outside retrieval in the late 1990s and early 2000s as researchers applied it to text classification and spam filtering. The early Naive Bayes spam filters of Sahami, Dumais, Heckerman, and Horvitz (1998) and the rule-based filter SpamAssassin both treated email as a bag of words and scored each message by the conditional probability of its tokens under spam vs. ham distributions.[6] Its straightforward implementation and reasonable performance on many tasks made BoW a default baseline for text analysis long before the rise of [deep learning](/wiki/deep_learning) and [word embedding](/wiki/word_embedding) methods.

## How does the bag of words model work?

The bag of words pipeline involves three main stages: tokenization, vocabulary construction, and vectorization. In production there is usually a fourth stage of vocabulary trimming and a fifth weighting step that converts raw counts to TF-IDF or BM25 scores before the vector reaches a classifier.

### tokenization

The first step is breaking raw text into individual units called [token](/wiki/token)s. In the simplest case, tokens are individual words separated by whitespace. More sophisticated tokenizers handle punctuation, contractions, and special characters. Common preprocessing steps applied during or after tokenization include:

- **Lowercasing** all text so that "Apple" and "apple" are treated as the same token
- **Stopword removal**, which filters out extremely common words like "the," "is," "and," and "a" that carry little discriminative meaning
- **[Stemming](/wiki/stemming)** or **[lemmatization](/wiki/lemmatization)**, which reduces words to their root forms (for example, "running," "ran," and "runs" all map to "run")

Tokenization choices have a surprisingly large effect on downstream model quality. A whitespace tokenizer that ignores punctuation will treat "don't" as a single token, while a regex tokenizer using `\w+` will split it into "don" and "t." What matters is consistency between training and inference. Stemming reduces "organize," "organizing," and "organization" to the same stem (typically "organ" under the Porter stemmer), shrinking the vocabulary at the cost of some semantic precision. Lemmatization is more linguistically faithful: it normalizes inflectional forms while preserving stems, mapping "better" to "good" and "organized" to "organize." The [Porter stemmer](/wiki/porter_stemmer), the Snowball stemmer, and the WordNet lemmatizer (via NLTK) cover most English use cases.

### vocabulary construction

After tokenization, the model builds a vocabulary: a list of all unique tokens found across the entire document collection (corpus). Each unique token is assigned an index position. For example, given two sentences:

- Sentence A: "The cat sat on the mat"
- Sentence B: "The dog sat on the log"

After lowercasing and removing the stopword "the" and "on," the vocabulary might be: `[cat, sat, mat, dog, log]`.

For a real corpus the vocabulary will not be five tokens. The 20 Newsgroups dataset produces a vocabulary of about 100,000 distinct word types after light cleaning. The English Wikipedia dump contains tens of millions of distinct tokens, most of them rare. The trick most pipelines use is **vocabulary trimming**: drop any token that appears in fewer than `min_df` documents (often 2 or 5) and any token that appears in more than `max_df` (often 0.9 or 0.95). This removes typos, OCR errors, and stopword-like tokens without a custom stopword list.

A related technique called **the hashing trick** sidesteps the vocabulary entirely. The system applies a fast hash function that maps every token to one of a fixed number of buckets (say 2^18 or 2^20). The output vector has that fixed size regardless of corpus, with the trade-off that distinct tokens can collide into the same bucket. Scikit-learn's `HashingVectorizer` implements this idea and is the standard choice for streaming or out-of-core text classification.

### vectorization

Each document is then represented as a numerical vector whose length equals the size of the vocabulary. The value at each position depends on the weighting scheme used. The collection of all such document vectors stacked into a matrix is called the **document-term matrix**, sometimes abbreviated DTM. With $$N$$ documents and a vocabulary of size $$\lvert V \rvert$$, the DTM is an $$N$$ by $$\lvert V \rvert$$ matrix, almost always stored sparsely because the typical document only touches a few hundred of the tens of thousands of vocabulary terms.

## What weighting schemes does bag of words use?

Several approaches exist for assigning values to the vector positions. The choice of scheme can significantly affect downstream model performance.

| Scheme | Description | Value at Position $$i$$ | Best For |
|---|---|---|---|
| Count (frequency) | Counts how many times each word appears in the document | Number of occurrences of word $$i$$ | General text classification |
| Binary | Records only whether a word is present or absent | 1 if word $$i$$ is present, 0 otherwise | Short documents, presence-based tasks |
| Log-frequency | Log-dampened raw counts | $$1 + \log(\text{count})$$ if $$\text{count} > 0$$, else 0 | Reducing the weight of repeated tokens |
| TF-IDF | Adjusts word counts by how common the word is across all documents | $$\text{TF}(i) \times \text{IDF}(i)$$ | Information retrieval, distinguishing important terms |
| Normalized frequency | Divides raw counts by the total number of words in the document | Count of word $$i$$ / total words | Comparing documents of different lengths |
| BM25 | Probabilistic ranking score with length normalization | See BM25 formula below | Search ranking, robust to long documents |

### count vectorization

The most basic form of BoW uses raw word counts. If the word "learning" appears three times in a document, the corresponding vector position has a value of 3. This approach is intuitive but can give disproportionate weight to frequent but uninformative words.

### binary vectorization

Binary BoW simply marks whether a word appears (1) or does not appear (0) in a document, ignoring frequency entirely. This works well for short texts where word repetition is rare, such as tweets or product reviews. Binary BoW is also the natural input for **multivariate Bernoulli naive Bayes**, where each document is modeled as a draw of $$\lvert V \rvert$$ independent Bernoulli variables that fire only on word presence.

### TF-IDF weighting

Term Frequency-Inverse Document Frequency (TF-IDF) is the most popular extension of the basic BoW model. It addresses a key weakness of raw counts: common words like "the" or "is" appear frequently everywhere and do not help distinguish documents from one another. TF-IDF downweights these common terms and upweights rare, distinctive ones.

The formula has two components:

- **Term Frequency (TF):** The number of times a term appears in a document, often normalized by the total number of terms in that document.
- **Inverse Document Frequency (IDF):** The logarithm of the total number of documents divided by the number of documents containing the term.

$$
\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \log(N / \text{DF}(t))
$$

Where $$t$$ is the term, $$d$$ is the document, $$N$$ is the total number of documents, and $$\text{DF}(t)$$ is the number of documents containing term $$t$$.

A word that appears frequently in one document but rarely across the corpus receives a high TF-IDF score, signaling that it is particularly relevant to that document.[3]

There are several common variants of the IDF expression. The classical Sparck Jones IDF is $$\log(N / \text{DF}(t))$$. Scikit-learn uses a smoothed version, $$\log((1 + N) / (1 + \text{DF}(t))) + 1$$, which avoids division by zero for unseen terms. Many implementations apply L2 normalization so that document length cancels out under cosine similarity. Combined with cosine similarity, normalized TF-IDF is essentially the lingua franca of classical IR. See the dedicated [TF-IDF](/wiki/tf_idf) article for more detail.

### BM25 weighting

BM25, formally Okapi BM25, is a probabilistic relevance-ranking function published by Stephen Robertson and Karen Sparck Jones in their 1994 and 1995 TREC papers and refined into its standard form in the late 1990s.[5] Although BM25 is more often described as a ranking function rather than a feature, it can be viewed as a non-linear weighting scheme over the same BoW vocabulary as TF-IDF. The standard formula is:

$$
\text{BM25}(q, d) = \sum_{t \in q} \text{IDF}(t) \cdot \frac{f(t,d) \cdot (k_1 + 1)}{f(t,d) + k_1 \cdot \left(1 - b + b \cdot \frac{\lvert d \rvert}{\text{avgdl}}\right)}
$$

where $$f(t,d)$$ is the frequency of term $$t$$ in document $$d$$, $$\lvert d \rvert$$ is the length of $$d$$, $$\text{avgdl}$$ is the average document length, and the parameters $$k_1$$ (typically 1.2 to 2.0) and $$b$$ (typically 0.75) control term-frequency saturation and length normalization. BM25 saturates: doubling the count of a term does not double its score, preventing a stuffed keyword from dominating. BM25 also normalizes for document length. These two corrections explain why BM25 has been the workhorse of full-text search for nearly thirty years and remains the default similarity in [Lucene](/wiki/lucene), Elasticsearch, OpenSearch, and Solr. See the dedicated [BM25](/wiki/bm25) article for derivations and tuning.

## worked example

Consider three short documents:

- **Doc 1:** "I love machine learning"
- **Doc 2:** "I love deep learning"
- **Doc 3:** "deep learning is fascinating"

After lowercasing and removing stopwords ("I," "is"), the vocabulary is: `[love, machine, learning, deep, fascinating]`.

The count vectors are:

| Document | love | machine | learning | deep | fascinating |
|---|---|---|---|---|---|
| Doc 1 | 1 | 1 | 1 | 0 | 0 |
| Doc 2 | 1 | 0 | 1 | 1 | 0 |
| Doc 3 | 0 | 0 | 1 | 1 | 1 |

Notice that "learning" appears in all three documents, so it would receive a low IDF score under TF-IDF weighting. Meanwhile, "machine" and "fascinating" each appear in only one document, so they would receive high IDF scores and help distinguish those documents.

Under TF-IDF weighting (smoothed scikit-learn variant with L2 normalization), "learning" receives a smaller weight than "machine" or "fascinating" in their respective documents, because IDF tells the model that "learning" is uninformative within this corpus. Cosine similarity between Doc 1 and Doc 2 (which share "love" and "learning") is higher than between Doc 1 and Doc 3, matching human intuition that the first two documents are more similar.

## What are n-gram extensions of bag of words?

A major weakness of the standard BoW model is that it treats each word independently, losing all information about word order. The [n-gram](/wiki/n-gram) extension addresses this by considering sequences of consecutive words as single tokens rather than individual words alone.

- A **unigram** model (n=1) is the standard bag of words
- A **bigram** model (n=2) includes two-word phrases like "machine learning" or "not good"
- A **trigram** model (n=3) captures three-word sequences like "natural language processing"

Using bigrams allows the model to partially capture word order. For instance, the phrases "not good" and "very good" become distinct [feature](/wiki/feature)s rather than being collapsed into the same set of individual words. Research on email spam classification has shown that trigram and 4-gram features can achieve classification accuracy above 98%.[7] In practice, combining unigrams with bigrams tends to offer the best balance between capturing local word order and keeping the vocabulary manageable.

| n-gram order | Example tokens for "natural language processing is fun" | Vocabulary size impact | Notes |
|---|---|---|---|
| Unigram (n=1) | natural, language, processing, is, fun | Baseline $$V$$ | Pure BoW; loses all word order |
| Bigram (n=2) | natural language, language processing, processing is, is fun | Up to $$V^2$$ in theory; in practice 5-15x $$V$$ | Captures common collocations like "not good" |
| Trigram (n=3) | natural language processing, language processing is, processing is fun | Even sparser; up to $$V^3$$ | Useful for short genre-specific corpora |
| Char 3-gram | nat, atu, tur, ura, ral, ... | Bounded by $$\text{alphabet}^3$$ | Robust to typos; used in language ID and spam filtering |
| Char 4-gram | natu, atur, tura, ural, ... | Bounded by $$\text{alphabet}^4$$ | Standard in fastText subword models |

Character n-grams are a cousin of word n-grams. They produce smaller and bounded vocabularies (limited by the alphabet) and tend to be robust to typographic noise, which is why early spam filters and forensic authorship attribution systems leaned on character 3-grams and 4-grams.[7] [fastText](/wiki/fasttext) reuses the same idea in a learned-embedding setting.

## How do you implement bag of words in Python?

The scikit-learn library provides two main classes for bag of words: `CountVectorizer` for raw counts and `TfidfVectorizer` for TF-IDF weighted vectors. Both follow the standard fit-transform pattern.

### CountVectorizer

```python
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "I love machine learning",
    "I love deep learning",
    "deep learning is fascinating"
]

vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names_out())
# ['deep', 'fascinating', 'learning', 'love', 'machine']

print(X.toarray())
# [[0, 0, 1, 1, 1],
#  [1, 0, 1, 1, 0],
#  [1, 1, 1, 0, 0]]
```

### TfidfVectorizer

```python
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)

print(X.toarray())
# Each value is now a TF-IDF weighted float
# instead of a raw integer count
```

Both classes support n-gram ranges through the `ngram_range` parameter. Setting `ngram_range=(1, 2)` includes both unigrams and bigrams.[8]

Scikit-learn internally stores BoW matrices as sparse matrices (using `scipy.sparse.csr_matrix`), which is essential for handling the high-dimensional, mostly-zero vectors that BoW produces.[9]

### using the hashing trick for streaming data

```python
from sklearn.feature_extraction.text import HashingVectorizer

hasher = HashingVectorizer(n_features=2**18, alternate_sign=False)
X = hasher.transform(corpus)
```

`HashingVectorizer` does not maintain a vocabulary, so it can be applied to a stream of new documents without coordinating a global token table. The trade-off is collisions. With 2^18 buckets and a corpus of 100,000 unique tokens, collision rates stay low.

### a full text-classification pipeline

```python
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import cross_val_score

train = fetch_20newsgroups(subset='train', categories=['sci.med', 'sci.space'])

pipe = Pipeline([
    ('vec', TfidfVectorizer(
        stop_words='english',
        ngram_range=(1, 2),
        min_df=2,
        max_df=0.9,
        sublinear_tf=True,
    )),
    ('clf', LogisticRegression(max_iter=1000, C=1.0)),
])

scores = cross_val_score(pipe, train.data, train.target, cv=5)
print(scores.mean())
```

A TF-IDF plus logistic regression pipeline like this one routinely scores in the high 90s on the binary medicine vs. space split of 20 Newsgroups, which is one of the reasons BoW remains the sanity-check baseline for any new neural text classifier.

### gensim and topic models

The gensim library uses a slightly different idiom in which a `Dictionary` builds the vocabulary and corpus iterators yield `(token_id, count)` tuples rather than dense vectors. This streaming-friendly format makes gensim a natural choice for fitting large topic models such as [LDA](/wiki/latent_dirichlet_allocation).

```python
from gensim.corpora import Dictionary
from gensim.models import LdaModel

texts = [doc.lower().split() for doc in corpus]

dictionary = Dictionary(texts)
bow_corpus = [dictionary.doc2bow(text) for text in texts]

lda = LdaModel(corpus=bow_corpus, id2word=dictionary, num_topics=2, passes=10)
```

LDA only operates on integer counts, so the bag of words representation is mandatory. This dependence is one of the reasons BoW continues to ship in modern data-science stacks even when the rest of the pipeline uses dense embeddings.

## classical NLP pipeline with BoW

A typical end-to-end BoW pipeline for text classification has the following stages:

| Stage | Tools | Purpose | Comment |
|---|---|---|---|
| Cleanup | regex, BeautifulSoup | Strip HTML, lowercase, remove URLs and emails | Critical for web-scraped text |
| Tokenization | regex, NLTK, spaCy | Split text into tokens | Choice between word, subword, or character tokens |
| Stopword removal | NLTK, sklearn lists | Remove uninformative high-frequency tokens | Sometimes hurts; topic models often want stopwords |
| Normalization | Porter, Snowball, WordNet | Stemming or lemmatization | Stemming faster, lemmatization more precise |
| Vocabulary trimming | min_df / max_df | Drop rare typos and ultra-frequent terms | Often more impactful than stopword removal |
| Vectorization | CountVectorizer, TfidfVectorizer | Build sparse DTM | Use n-gram ranges of (1, 2) as a default |
| Weighting | TF-IDF, BM25 | Down-weight common, up-weight discriminative | Sublinear TF often helps |
| Modeling | naive Bayes, logistic regression, SVM | Fit a classifier or ranker | Linear models scale well to high-dim sparse input |
| Evaluation | accuracy, F1, MAP, NDCG | Score predictions | Pick a metric matched to the task |

The pipeline is deliberately modular: each step has a well-tested implementation in [scikit-learn](/wiki/scikit_learn), spaCy, or NLTK. When a transformer fails on a niche corpus, an engineer can usually stand up a TF-IDF baseline in an afternoon, ship it, and revisit the deep-learning option later.

## What is the bag of words model used for?

Despite its simplicity, the bag of words model has proven effective across a wide range of tasks.

| Application | How BoW Is Used | Typical Classifiers |
|---|---|---|
| [Sentiment analysis](/wiki/sentiment_analysis) | Documents are vectorized and classified as positive, negative, or neutral based on word frequencies | [Naive Bayes](/wiki/naive_bayes), [logistic regression](/wiki/logistic_regression), [SVM](/wiki/support_vector_machine_svm) |
| Spam detection | Emails are converted to BoW vectors; spam-indicative words receive high weights | Naive Bayes, [random forest](/wiki/random_forest) |
| Topic modeling | BoW matrices serve as input to algorithms that discover latent topics | [Latent Dirichlet Allocation](/wiki/latent_dirichlet_allocation) (LDA), Non-negative Matrix Factorization (NMF) |
| Document classification | BoW vectors enable assigning documents to predefined categories | SVM, logistic regression, [decision tree](/wiki/decision_tree) |
| Information retrieval | TF-IDF weighted BoW vectors are used to rank documents by relevance to a search query | Cosine similarity ranking, [BM25](/wiki/bm25) |
| Authorship attribution | Stylometric features based on word frequencies help identify the author of a text | SVM, [neural network](/wiki/neural_network) |
| Plagiarism detection | Pairwise cosine similarity over BoW vectors flags near-duplicate documents | Threshold cosine, MinHash |
| Document clustering | Sparse TF-IDF vectors are clustered with k-means or spectral methods | k-means, agglomerative clustering |
| News deduplication | Compact BoW signatures identify near-duplicate stories in real-time feeds | LSH over TF-IDF |
| Patent prior-art search | TF-IDF vectors over patent texts retrieve related patents and publications | BM25 |
| Legal e-discovery | TF-IDF and Boolean BoW retrieval surface relevant documents during litigation | Logistic regression on TF-IDF |

### sentiment analysis at scale

The textbook example of BoW + Naive Bayes for sentiment analysis runs on the IMDb movie review dataset of 50,000 polarity-labeled reviews collected by Maas et al. for a 2011 ACL paper, split into 25,000 training and 25,000 test reviews, with reviews labeled positive only when the source rating was at least 7 out of 10 and negative only when it was at most 4 out of 10.[10] A unigram TF-IDF classifier with logistic regression typically scores about 88% accuracy on that benchmark, well below modern transformer scores in the mid-90s but remarkable for a bag-of-counts model that fits in a few hundred megabytes of RAM and trains in seconds. For sentiment analysis on very long documents (legal filings, financial reports), BoW often outperforms small transformer fine-tunes because document-level sentiment is essentially a frequency phenomenon.

### spam filtering and email triage

Naive Bayes spam filters trace their lineage to Sahami, Dumais, Heckerman, and Horvitz at Microsoft Research in 1998 and to Paul Graham's 2002 essay "A Plan for Spam."[11] Both approaches treat email as a bag of words, compute the conditional probability of each token under the spam and ham classes, and combine the per-token probabilities into a single message-level score using Bayes' rule with an independence assumption. SpamAssassin, the de-facto open-source filter on UNIX mail servers since the early 2000s, uses a hybrid of hand-written rules and BoW Naive Bayes scoring. BoW Naive Bayes remains a live component of large-scale spam pipelines because it is fast, interpretable, and easy to update online with new tokens.

### topic modeling with LDA

David Blei, Andrew Ng, and Michael Jordan introduced [Latent Dirichlet Allocation](/wiki/latent_dirichlet_allocation) (LDA) in their 2003 JMLR paper.[12] LDA assumes each document is a mixture of latent topics, and each topic is a probability distribution over words. The input is a BoW count matrix; the output is two factorized matrices, one mapping documents to topic mixtures and the other mapping topics to word distributions. LDA's likelihood is defined over discrete word counts, not real-valued embeddings, so it cannot consume TF-IDF vectors directly. This is one practical reason BoW count vectors are still the standard input format for topic models.

### information retrieval

Full-text search has been the canonical BoW application since the SMART system. Modern search engines such as Apache [Lucene](/wiki/lucene) (and the Elasticsearch, OpenSearch, and Solr products built on top of it) maintain inverted indexes that map each vocabulary term to the list of documents containing it. At query time the engine retrieves the posting lists for query terms, scores each candidate document with BM25 or a TF-IDF variant, and returns the top-k by score. A 2019 study by Yang, Lu, and Lin showed that BM25 over standard TREC collections is still competitive with neural rerankers for many query types, especially when the queries are short and keyword-driven.[13]

## bag of visual words in computer vision

The bag of words concept has been adapted for [computer vision](/wiki/computer_vision) under the name "bag of visual words" (BoVW). Instead of counting text words, BoVW counts visual features extracted from images. The process follows three steps:

1. **Feature extraction:** Local features are detected in images using algorithms like Scale-Invariant Feature Transform (SIFT), which converts each image patch into a 128-dimensional descriptor vector that is invariant to scale, rotation, and illumination changes.
2. **Codebook construction:** The extracted feature descriptors from many images are clustered using [k-means](/wiki/k-means) clustering. Each cluster center becomes a "visual word," and the collection of all cluster centers forms a "codebook" (analogous to a text vocabulary).
3. **Histogram generation:** Each image is represented as a histogram counting how many of its feature descriptors fall into each visual word cluster.

The seminal paper introducing BoVW for image retrieval was Sivic and Zisserman's 2003 "Video Google" system, which adapted text retrieval techniques to retrieve object instances in video frames.[14] Csurka et al.'s 2004 paper "Visual Categorization with Bags of Keypoints" applied the same idea to image classification with SIFT keypoints and a k-means codebook.[15] BoVW was one of the most successful methods for image classification and content-based image retrieval before the rise of [convolutional neural networks](/wiki/convolutional_neural_network). Alternatives like Vector of Locally Aggregated Descriptors (VLAD) and Fisher Vectors improved upon BoVW by encoding higher-order statistics about the feature distribution.[15] BoVW lingers in classical computer-vision pipelines for visual SLAM (where loop-closure detection often runs on bags of binary descriptors) and large-scale image retrieval where memory and inference cost dominate.

## What are the limitations of bag of words?

The bag of words model has several well-known shortcomings that limit its effectiveness for more complex language understanding tasks.

### loss of word order

Because BoW treats text as an unordered collection of words, it cannot distinguish between sentences with different meanings that use the same words. "The dog bit the man" and "The man bit the dog" produce identical BoW vectors. Negation is another problem: "good" and "not good" are near-opposites, but a BoW model sees them as similar since they share the word "good."[16] Bigrams partially address this by introducing tokens like "not_good," but the fundamental issue persists for any structure that crosses more than two-token windows.

### high dimensionality

The vocabulary size directly determines the vector length. For a realistic corpus, vocabularies can easily reach 50,000 to 100,000 words or more. Each document becomes a vector of that length, which can slow down model training and increase memory usage. This problem, known as the "curse of dimensionality," can also degrade classifier performance when the number of features far exceeds the number of training samples.

### sparsity

Since any individual document uses only a small fraction of the total vocabulary, the resulting vectors are extremely sparse (mostly zeros). While sparse matrix formats mitigate the storage problem, the sparsity itself can reduce the effectiveness of distance-based algorithms that rely on dense representations. Methods like [latent semantic analysis](/wiki/latent_semantic_analysis) (LSA), which applies truncated singular value decomposition to the BoW matrix, were originally designed to address this sparsity by projecting documents into a denser, lower-dimensional latent space.

### no semantic understanding

BoW treats every word as an independent, orthogonal dimension with no notion of similarity. The words "happy" and "joyful" are as different as "happy" and "bicycle" in a BoW representation. The model cannot capture synonymy, polysemy, or any other semantic relationship between words.

### out-of-vocabulary handling

At inference time, any token that did not appear in the training vocabulary is silently dropped. For long-tail or evolving domains (product names, hashtags, slang), the BoW pipeline therefore degrades over time without an obvious failure signal. Hashing vectorizers sidestep this problem by mapping any token (seen or unseen) to a bucket, but at the cost of collisions and zero interpretability.

### sensitivity to preprocessing

Results can swing several percentage points based on tokenization, stopword list, and stemmer choices. There is no universal best preprocessing pipeline. Practitioners typically grid-search over preprocessing combinations using cross-validation, which is computationally cheap because BoW vectorization is fast.

## How does bag of words differ from word embeddings?

Modern [word embedding](/wiki/word_embedding) methods like [Word2Vec](/wiki/word2vec), [GloVe](/wiki/glove), [fastText](/wiki/fasttext), and contextual embeddings from [transformer](/wiki/transformer) models address many of the limitations of BoW. The table below summarizes the key differences.

| Property | Bag of Words | Word Embeddings |
|---|---|---|
| Vector type | Sparse, high-dimensional | Dense, low-dimensional (50-300 dims typical) |
| Semantic similarity | Not captured | Captured (similar words have similar vectors) |
| Word order | Completely ignored | Partially captured (contextual embeddings fully capture it) |
| Vocabulary dependence | Fixed vocabulary, out-of-vocabulary words are lost | Subword methods handle unseen words |
| Interpretability | High (each dimension corresponds to a known word) | Low (dimensions lack direct interpretation) |
| Computational cost for representation | Low (simple counting) | Higher (requires pre-trained model) |
| Training data requirement | Works with small datasets | Pre-trained models need large corpora |

[Word2Vec](/wiki/word2vec), introduced by Mikolov et al. in 2013, learns dense word vectors from local context windows using either a skip-gram or continuous bag of words objective.[17] Pennington, Socher, and Manning's 2014 [GloVe](/wiki/glove) algorithm trains on global co-occurrence statistics from the corpus.[18] Bojanowski et al.'s 2016 [fastText](/wiki/fasttext) adds character n-gram averaging on top of word2vec.[19] Devlin et al.'s 2018 [BERT](/wiki/bert) and the broader transformer family produce contextual embeddings: the same word receives different vectors in different sentences, breaking the static-token assumption shared by BoW and word2vec alike.[20]

Bag of words remains a strong choice when interpretability matters, training data is limited, or the task is simple enough that word order and semantics are less important. For tasks requiring deeper language understanding, embeddings and [one-hot encoding](/wiki/one-hot_encoding)-based methods that feed into neural networks are generally preferred.[21]

### a useful conceptual bridge

Word2Vec's CBOW (continuous bag of words) variant predicts a target word from the average of its context word vectors. The name reuses "bag of words" because the context vectors are added without regard for order. Although Word2Vec is a dense embedding model, one of its two training objectives is built on a literal bag-of-words view of local context. The bag idea outlived its sparse representation.

## Is bag of words still used in 2026?

In 2026, few production systems rely on BoW alone. Yet BoW remains ubiquitous as a baseline and as a feature pipeline. Three patterns recur:

1. **Sanity-check baselines.** When a team trains a new transformer fine-tune for text classification, the first comparison is almost always against TF-IDF + logistic regression. If the deep model does not beat that baseline, the labels are noisy, the metric is wrong, or the model is underfit. The TF-IDF baseline is genuinely difficult to beat on tasks like topic classification with abundant labeled data and short documents.[22]
2. **Hybrid retrieval.** Modern dense-retrieval systems (DPR, ColBERT, contriever) often pair their neural rankers with a BM25 first-stage retriever. The BM25 stage handles vocabulary mismatch and rare query terms while the neural ranker handles semantic matching. This sparse-dense fusion is the dominant pattern in industrial search and retrieval-augmented generation pipelines.[13]
3. **Lightweight production classifiers.** Spam filters, content moderation triage, and ticket routing systems often use BoW + linear classifiers because they are cheap to retrain and give per-token interpretability that helps with debugging.

## extensions and variants

Several models elaborate on the BoW idea by preserving the bag and changing the weighting or factorization:

- **Latent Semantic Analysis (LSA).** Truncated SVD applied to a TF-IDF matrix.
- **Probabilistic Latent Semantic Analysis (pLSA).** Hofmann's 1999 model gave LSA a probabilistic generative interpretation.
- **Latent Dirichlet Allocation (LDA).** Blei, Ng, and Jordan, 2003. Uses BoW counts as input.[12]
- **Non-negative Matrix Factorization (NMF).** Lee and Seung, 1999. Factors a non-negative count matrix into document-topic and topic-word matrices.
- **Continuous bag of words (CBOW).** The Word2Vec variant where context vectors are averaged.
- **Bag of n-grams.** Generalizes BoW to multi-word tokens.
- **Bag of visual words (BoVW).** Computer-vision analog using SIFT or ORB descriptors.
- **Bag of audio words.** MFCC or other acoustic descriptors clustered into audio words.

## common pitfalls

A short field guide for engineers running BoW in production:

- **Train and test must use the same vocabulary.** Always serialize the fitted vectorizer alongside the trained model.
- **Stopword removal can hurt topic models.** LDA actually benefits from having stopwords if the corpus is small.
- **Class imbalance.** BoW classifiers are very sensitive to class imbalance because the prior probability flows directly into Naive Bayes scores.
- **Feature explosion with high n-grams.** Cap with `min_df` and `max_features`.
- **Length effects.** Either L2-normalize TF-IDF vectors or use BM25 to avoid length bias.

## explain like I'm 5 (ELI5)

Imagine you have a big box of toy blocks. Each block has a word written on it. When you read a story, you grab a block for every word in that story and put it in a bag. You do not care about the order the words appeared in. You just count how many times each word showed up. A story about cats might have three "cat" blocks, two "fish" blocks, and one "sleep" block. A story about dogs might have four "dog" blocks and one "park" block. By looking at what is in each bag, a computer can tell the two stories are about different things, even though it never read them like you would. The bag is also mostly empty: it is sized to hold any of 50,000 words but a typical story only uses 200 of them. That is what computer scientists mean when they say BoW vectors are sparse and high-dimensional.

## see also

[TF-IDF](/wiki/tf_idf), [BM25](/wiki/bm25), [Word2Vec](/wiki/word2vec), [GloVe](/wiki/glove), [fastText](/wiki/fasttext), [BERT](/wiki/bert), [Latent Dirichlet Allocation](/wiki/latent_dirichlet_allocation), [latent semantic analysis](/wiki/latent_semantic_analysis), [naive Bayes](/wiki/naive_bayes), [scikit-learn](/wiki/scikit_learn), [n-gram](/wiki/n-gram), [stemming](/wiki/stemming), [lemmatization](/wiki/lemmatization), [information retrieval](/wiki/information_retrieval).

## References

1. Harris, Zellig S. (1954). "Distributional Structure." *Word*, 10(2-3), 146-162.
2. Salton, Gerard; Wong, A.; Yang, C. S. (1975). "A Vector Space Model for Automatic Indexing." *Communications of the ACM*, 18(11), 613-620. https://dl.acm.org/doi/10.1145/361219.361220
3. Salton, Gerard; Buckley, Christopher (1988). "Term-weighting approaches in automatic text retrieval." *Information Processing & Management*, 24(5), 513-523.
4. Sparck Jones, Karen (1972). "A statistical interpretation of term specificity and its application in retrieval." *Journal of Documentation*, 28(1), 11-21.
5. Robertson, Stephen E.; Walker, Steve; Jones, Susan; Hancock-Beaulieu, Micheline; Gatford, Mike (1995). "Okapi at TREC-3." *Proceedings of the 3rd Text REtrieval Conference (TREC-3)*, 109-126.
6. Sahami, Mehran; Dumais, Susan; Heckerman, David; Horvitz, Eric (1998). "A Bayesian Approach to Filtering Junk E-Mail." *AAAI Workshop on Learning for Text Categorization*.
7. Kanaris, Ioannis et al. (2007). "Words versus Character N-Grams for Anti-Spam Filtering." *International Journal on Artificial Intelligence Tools*, 16(06), 1047-1067.
8. Pedregosa, Fabian et al. (2011). "Scikit-learn: Machine Learning in Python." *Journal of Machine Learning Research*, 12, 2825-2830.
9. scikit-learn documentation. "Feature extraction: Text feature extraction." https://scikit-learn.org/stable/modules/feature_extraction.html
10. Maas, Andrew L.; Daly, Raymond E.; Pham, Peter T.; Huang, Dan; Ng, Andrew Y.; Potts, Christopher (2011). "Learning Word Vectors for Sentiment Analysis." *Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies*, 142-150. https://ai.stanford.edu/~amaas/data/sentiment/
11. Graham, Paul (2002). "A Plan for Spam." http://www.paulgraham.com/spam.html
12. Blei, David M.; Ng, Andrew Y.; Jordan, Michael I. (2003). "Latent Dirichlet Allocation." *Journal of Machine Learning Research*, 3, 993-1022.
13. Yang, Wei; Lu, Kuang; Lin, Jimmy (2019). "Critically Examining the 'Neural Hype': Weak Baselines and the Additivity of Effectiveness Gains from Neural Ranking Models." *Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval*, 1129-1132.
14. Sivic, Josef; Zisserman, Andrew (2003). "Video Google: A Text Retrieval Approach to Object Matching in Videos." *Proceedings of the IEEE International Conference on Computer Vision*, 1470-1477.
15. Csurka, Gabriella; Dance, Christopher R.; Fan, Lixin; Willamowski, Jutta; Bray, Cedric (2004). "Visual Categorization with Bags of Keypoints." *Workshop on Statistical Learning in Computer Vision (ECCV)*, 1-22.
16. Manning, Christopher D.; Raghavan, Prabhakar; Schutze, Hinrich (2008). *Introduction to Information Retrieval*. Cambridge University Press.
17. Mikolov, Tomas; Chen, Kai; Corrado, Greg; Dean, Jeffrey (2013). "Efficient Estimation of Word Representations in Vector Space." *Proceedings of ICLR Workshop*.
18. Pennington, Jeffrey; Socher, Richard; Manning, Christopher D. (2014). "GloVe: Global Vectors for Word Representation." *Proceedings of EMNLP*, 1532-1543.
19. Bojanowski, Piotr; Grave, Edouard; Joulin, Armand; Mikolov, Tomas (2017). "Enriching Word Vectors with Subword Information." *Transactions of the Association for Computational Linguistics*, 5, 135-146.
20. Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." *Proceedings of NAACL-HLT*, 4171-4186.
21. Jurafsky, Daniel; Martin, James H. (2024). *Speech and Language Processing*, 3rd edition draft.
22. Wang, Sida; Manning, Christopher D. (2012). "Baselines and Bigrams: Simple, Good Sentiment and Topic Classification." *Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics*, 90-94.