The bag of words (BoW) model is one of the simplest and most widely used methods for representing text as numerical data in machine learning and natural language processing. It converts a document into a fixed-length vector by counting the occurrences of each word from a predefined vocabulary, discarding grammar and word order in the process. Despite its simplicity, the bag of words model has served as a foundational text representation technique for decades and remains relevant for many practical applications.
The conceptual roots of the bag of words model trace back to linguist Zellig Harris's 1954 article "Distributional Structure," which explored the idea that the distribution of words in context carries meaningful information about language. Harris notably observed that "language is not merely a bag of words but a tool with particular properties which have been fashioned in the course of its use," an ironic early usage of the phrase that would later lend its name to the model.[1] Harris's broader distributional hypothesis, which states that words occurring in similar contexts tend to have similar meanings, laid the theoretical groundwork for many statistical approaches to language, including BoW.
The bag of words model gained practical traction in the late 1990s and early 2000s as researchers applied it to information retrieval, text classification, and spam filtering. Its straightforward implementation and reasonable performance on many tasks made it a default baseline for text analysis long before the rise of deep learning and word embedding methods.
The bag of words pipeline involves three main stages: tokenization, vocabulary construction, and vectorization. Each stage transforms raw text into a progressively more structured numerical form.
The first step is breaking raw text into individual units called tokens. In the simplest case, tokens are individual words separated by whitespace. More sophisticated tokenizers handle punctuation, contractions, and special characters. Common preprocessing steps applied during or after tokenization include:
After tokenization, the model builds a vocabulary: a list of all unique tokens found across the entire document collection (corpus). Each unique token is assigned an index position. For example, given two sentences:
After lowercasing and removing the stopword "the" and "on," the vocabulary might be: [cat, sat, mat, dog, log].
Each document is then represented as a numerical vector whose length equals the size of the vocabulary. The value at each position depends on the weighting scheme used.
Several approaches exist for assigning values to the vector positions. The choice of scheme can significantly affect downstream model performance.
| Scheme | Description | Value at Position i | Best For |
|---|---|---|---|
| Count (frequency) | Counts how many times each word appears in the document | Number of occurrences of word i | General text classification |
| Binary | Records only whether a word is present or absent | 1 if word i is present, 0 otherwise | Short documents, presence-based tasks |
| TF-IDF | Adjusts word counts by how common the word is across all documents | TF(i) x IDF(i) | Information retrieval, distinguishing important terms |
| Normalized frequency | Divides raw counts by the total number of words in the document | Count of word i / total words | Comparing documents of different lengths |
The most basic form of BoW uses raw word counts. If the word "learning" appears three times in a document, the corresponding vector position has a value of 3. This approach is intuitive but can give disproportionate weight to frequent but uninformative words.
Binary BoW simply marks whether a word appears (1) or does not appear (0) in a document, ignoring frequency entirely. This works well for short texts where word repetition is rare, such as tweets or product reviews.
Term Frequency-Inverse Document Frequency (TF-IDF) is the most popular extension of the basic BoW model. It addresses a key weakness of raw counts: common words like "the" or "is" appear frequently everywhere and do not help distinguish documents from one another. TF-IDF downweights these common terms and upweights rare, distinctive ones.
The formula has two components:
TF-IDF(t, d) = TF(t, d) x log(N / DF(t))
Where t is the term, d is the document, N is the total number of documents, and DF(t) is the number of documents containing term t.
A word that appears frequently in one document but rarely across the corpus receives a high TF-IDF score, signaling that it is particularly relevant to that document.[2]
Consider three short documents:
After lowercasing and removing stopwords ("I," "is"), the vocabulary is: [love, machine, learning, deep, fascinating].
The count vectors are:
| Document | love | machine | learning | deep | fascinating |
|---|---|---|---|---|---|
| Doc 1 | 1 | 1 | 1 | 0 | 0 |
| Doc 2 | 1 | 0 | 1 | 1 | 0 |
| Doc 3 | 0 | 0 | 1 | 1 | 1 |
Notice that "learning" appears in all three documents, so it would receive a low IDF score under TF-IDF weighting. Meanwhile, "machine" and "fascinating" each appear in only one document, so they would receive high IDF scores and help distinguish those documents.
A major weakness of the standard BoW model is that it treats each word independently, losing all information about word order. The n-gram extension addresses this by considering sequences of consecutive words as single tokens rather than individual words alone.
Using bigrams allows the model to partially capture word order. For instance, the phrases "not good" and "very good" become distinct features rather than being collapsed into the same set of individual words. Research on email spam classification has shown that trigram and 4-gram features can achieve classification accuracy above 98%.[3] In practice, combining unigrams with bigrams tends to offer the best balance between capturing local word order and keeping the vocabulary manageable.
The scikit-learn library provides two main classes for bag of words: CountVectorizer for raw counts and TfidfVectorizer for TF-IDF weighted vectors. Both follow the standard fit-transform pattern.
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
"I love machine learning",
"I love deep learning",
"deep learning is fascinating"
]
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())
# ['deep', 'fascinating', 'learning', 'love', 'machine']
print(X.toarray())
# [[0, 0, 1, 1, 1],
# [1, 0, 1, 1, 0],
# [1, 1, 1, 0, 0]]
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)
print(X.toarray())
# Each value is now a TF-IDF weighted float
# instead of a raw integer count
Both classes support n-gram ranges through the ngram_range parameter. Setting ngram_range=(1, 2) includes both unigrams and bigrams.[4]
Scikit-learn internally stores BoW matrices as sparse matrices (using scipy.sparse.csr_matrix), which is essential for handling the high-dimensional, mostly-zero vectors that BoW produces.[5]
Despite its simplicity, the bag of words model has proven effective across a wide range of tasks.
| Application | How BoW Is Used | Typical Classifiers |
|---|---|---|
| Sentiment analysis | Documents are vectorized and classified as positive, negative, or neutral based on word frequencies | Naive Bayes, logistic regression, SVM |
| Spam detection | Emails are converted to BoW vectors; spam-indicative words receive high weights | Naive Bayes, random forest |
| Topic modeling | BoW matrices serve as input to algorithms that discover latent topics | Latent Dirichlet Allocation (LDA), Non-negative Matrix Factorization (NMF) |
| Document classification | BoW vectors enable assigning documents to predefined categories | SVM, logistic regression, decision tree |
| Information retrieval | TF-IDF weighted BoW vectors are used to rank documents by relevance to a search query | Cosine similarity ranking |
| Authorship attribution | Stylometric features based on word frequencies help identify the author of a text | SVM, neural network |
The bag of words concept has been adapted for computer vision under the name "bag of visual words" (BoVW). Instead of counting text words, BoVW counts visual features extracted from images. The process follows three steps:
BoVW was one of the most successful methods for image classification and content-based image retrieval before the rise of convolutional neural networks. More recent alternatives like Vector of Locally Aggregated Descriptors (VLAD) and Fisher Vectors have improved upon BoVW by encoding higher-order statistics about the feature distribution.[6]
The bag of words model has several well-known shortcomings that limit its effectiveness for more complex language understanding tasks.
Because BoW treats text as an unordered collection of words, it cannot distinguish between sentences with different meanings that use the same words. "The dog bit the man" and "The man bit the dog" produce identical BoW vectors. Negation is another problem: "good" and "not good" are near-opposites, but a BoW model sees them as similar since they share the word "good."[7]
The vocabulary size directly determines the vector length. For a realistic corpus, vocabularies can easily reach 50,000 to 100,000 words or more. Each document becomes a vector of that length, which can slow down model training and increase memory usage. This problem, known as the "curse of dimensionality," can also degrade classifier performance when the number of features far exceeds the number of training samples.
Since any individual document uses only a small fraction of the total vocabulary, the resulting vectors are extremely sparse (mostly zeros). While sparse matrix formats mitigate the storage problem, the sparsity itself can reduce the effectiveness of distance-based algorithms that rely on dense representations.
BoW treats every word as an independent, orthogonal dimension with no notion of similarity. The words "happy" and "joyful" are as different as "happy" and "bicycle" in a BoW representation. The model cannot capture synonymy, polysemy, or any other semantic relationship between words.
Modern word embedding methods like Word2Vec, GloVe, and contextual embeddings from transformer models address many of the limitations of BoW. The table below summarizes the key differences.
| Property | Bag of Words | Word Embeddings |
|---|---|---|
| Vector type | Sparse, high-dimensional | Dense, low-dimensional (50-300 dims typical) |
| Semantic similarity | Not captured | Captured (similar words have similar vectors) |
| Word order | Completely ignored | Partially captured (contextual embeddings fully capture it) |
| Vocabulary dependence | Fixed vocabulary, out-of-vocabulary words are lost | Subword methods handle unseen words |
| Interpretability | High (each dimension corresponds to a known word) | Low (dimensions lack direct interpretation) |
| Computational cost for representation | Low (simple counting) | Higher (requires pre-trained model) |
| Training data requirement | Works with small datasets | Pre-trained models need large corpora |
Bag of words remains a strong choice when interpretability matters, training data is limited, or the task is simple enough that word order and semantics are less important. For tasks requiring deeper language understanding, embeddings and one-hot encoding-based methods that feed into neural networks are generally preferred.[8]
Imagine you have a big box of toy blocks. Each block has a word written on it. When you read a story, you grab a block for every word in that story and put it in a bag. You do not care about the order the words appeared in. You just count how many times each word showed up. A story about cats might have three "cat" blocks, two "fish" blocks, and one "sleep" block. A story about dogs might have four "dog" blocks and one "park" block. By looking at what is in each bag, a computer can tell the two stories are about different things, even though it never actually read them like you would.