Bag of Words

The bag of words (BoW) model is one of the simplest and most widely used methods for representing text as numerical data in machine learning and natural language processing. It converts a document into a fixed-length vector by counting the occurrences of each word from a predefined vocabulary, discarding grammar and word order in the process. Despite its simplicity, the bag of words model has served as a foundational text representation technique for decades and remains relevant for many practical applications, particularly in classical information retrieval, naive Bayes spam classifiers, baseline text classifiers, and as a feature pipeline that feeds into topic modeling algorithms.

In its most general statement, BoW commits to a fixed vocabulary V of size |V| and maps every document d to a vector x in R^|V|, where component x_i records how often word i occurs in d. The exact mapping depends on the weighting scheme: raw counts, binary presence flags, normalized counts, TF-IDF weights, or BM25 scores. Whatever syntax, ordering, and discourse structure the original text contained gets thrown out, leaving a single vector that summarizes which words appeared and how often. The surprise of BoW is how often that crude summary turns out to be enough.

historical background

The conceptual roots of the bag of words model trace back to linguist Zellig Harris's 1954 article "Distributional Structure," which explored the idea that the distribution of words in context carries meaningful information about language. Harris notably observed that "language is not merely a bag of words but a tool with particular properties which have been fashioned in the course of its use," an ironic early usage of the phrase that would later lend its name to the model.[1] Harris's broader distributional hypothesis, which states that words occurring in similar contexts tend to have similar meanings, laid the theoretical groundwork for many statistical approaches to language, including BoW.

The explicit formalization of BoW as a numerical representation came from work in information retrieval during the 1960s and 1970s. Gerard Salton and his collaborators at Cornell developed the SMART Information Retrieval System, which represented documents and queries as numeric vectors over a vocabulary. Their 1975 paper "A Vector Space Model for Automatic Indexing" set out the geometry that underlies modern retrieval: documents are points in a high-dimensional space, queries are also points, and similarity reduces to the cosine of the angle between two vectors.[2] The vocabulary defines the axes and BoW counts give the coordinates.

A key refinement followed in 1988 when Salton and Christopher Buckley published "Term-weighting approaches in automatic text retrieval," cataloging the family of TF-IDF style weights that would dominate text retrieval for the next quarter century.[3] Karen Sparck Jones had already published her 1972 paper introducing inverse document frequency as a way to scale term importance based on rarity.[4] The Okapi BM25 weighting function, developed by Stephen Robertson and his collaborators at City University in London during the early 1990s, became the standard probabilistic refinement of TF-IDF and is still the default ranker in many open-source search engines today.[5]

The bag of words model gained practical traction outside retrieval in the late 1990s and early 2000s as researchers applied it to text classification and spam filtering. The early Naive Bayes spam filters of Sahami, Dumais, Heckerman, and Horvitz (1998) and the rule-based filter SpamAssassin both treated email as a bag of words and scored each message by the conditional probability of its tokens under spam vs. ham distributions.[6] Its straightforward implementation and reasonable performance on many tasks made BoW a default baseline for text analysis long before the rise of deep learning and word embedding methods.

how it works

The bag of words pipeline involves three main stages: tokenization, vocabulary construction, and vectorization. In production there is usually a fourth stage of vocabulary trimming and a fifth weighting step that converts raw counts to TF-IDF or BM25 scores before the vector reaches a classifier.

tokenization

The first step is breaking raw text into individual units called tokens. In the simplest case, tokens are individual words separated by whitespace. More sophisticated tokenizers handle punctuation, contractions, and special characters. Common preprocessing steps applied during or after tokenization include:

Lowercasing all text so that "Apple" and "apple" are treated as the same token
Stopword removal, which filters out extremely common words like "the," "is," "and," and "a" that carry little discriminative meaning
Stemming or lemmatization, which reduces words to their root forms (for example, "running," "ran," and "runs" all map to "run")

Tokenization choices have a surprisingly large effect on downstream model quality. A whitespace tokenizer that ignores punctuation will treat "don't" as a single token, while a regex tokenizer using \w+ will split it into "don" and "t." What matters is consistency between training and inference. Stemming reduces "organize," "organizing," and "organization" to the same stem (typically "organ" under the Porter stemmer), shrinking the vocabulary at the cost of some semantic precision. Lemmatization is more linguistically faithful: it normalizes inflectional forms while preserving stems, mapping "better" to "good" and "organized" to "organize." The Porter stemmer, the Snowball stemmer, and the WordNet lemmatizer (via NLTK) cover most English use cases.

vocabulary construction

After tokenization, the model builds a vocabulary: a list of all unique tokens found across the entire document collection (corpus). Each unique token is assigned an index position. For example, given two sentences:

Sentence A: "The cat sat on the mat"
Sentence B: "The dog sat on the log"

After lowercasing and removing the stopword "the" and "on," the vocabulary might be: [cat, sat, mat, dog, log].

For a real corpus the vocabulary will not be five tokens. The 20 Newsgroups dataset produces a vocabulary of about 100,000 distinct word types after light cleaning. The English Wikipedia dump contains tens of millions of distinct tokens, most of them rare. The trick most pipelines use is vocabulary trimming: drop any token that appears in fewer than min_df documents (often 2 or 5) and any token that appears in more than max_df (often 0.9 or 0.95). This removes typos, OCR errors, and stopword-like tokens without a custom stopword list.

A related technique called the hashing trick sidesteps the vocabulary entirely. The system applies a fast hash function that maps every token to one of a fixed number of buckets (say 2^18 or 2^20). The output vector has that fixed size regardless of corpus, with the trade-off that distinct tokens can collide into the same bucket. Scikit-learn's HashingVectorizer implements this idea and is the standard choice for streaming or out-of-core text classification.

vectorization

Each document is then represented as a numerical vector whose length equals the size of the vocabulary. The value at each position depends on the weighting scheme used. The collection of all such document vectors stacked into a matrix is called the document-term matrix, sometimes abbreviated DTM. With N documents and a vocabulary of size |V|, the DTM is an N by |V| matrix, almost always stored sparsely because the typical document only touches a few hundred of the tens of thousands of vocabulary terms.

weighting schemes

Several approaches exist for assigning values to the vector positions. The choice of scheme can significantly affect downstream model performance.

Scheme	Description	Value at Position i	Best For
Count (frequency)	Counts how many times each word appears in the document	Number of occurrences of word i	General text classification
Binary	Records only whether a word is present or absent	1 if word i is present, 0 otherwise	Short documents, presence-based tasks
Log-frequency	Log-dampened raw counts	1 + log(count) if count > 0, else 0	Reducing the weight of repeated tokens
TF-IDF	Adjusts word counts by how common the word is across all documents	TF(i) x IDF(i)	Information retrieval, distinguishing important terms
Normalized frequency	Divides raw counts by the total number of words in the document	Count of word i / total words	Comparing documents of different lengths
BM25	Probabilistic ranking score with length normalization	See BM25 formula below	Search ranking, robust to long documents

count vectorization

The most basic form of BoW uses raw word counts. If the word "learning" appears three times in a document, the corresponding vector position has a value of 3. This approach is intuitive but can give disproportionate weight to frequent but uninformative words.

binary vectorization

Binary BoW simply marks whether a word appears (1) or does not appear (0) in a document, ignoring frequency entirely. This works well for short texts where word repetition is rare, such as tweets or product reviews. Binary BoW is also the natural input for multivariate Bernoulli naive Bayes, where each document is modeled as a draw of |V| independent Bernoulli variables that fire only on word presence.

TF-IDF weighting

Term Frequency-Inverse Document Frequency (TF-IDF) is the most popular extension of the basic BoW model. It addresses a key weakness of raw counts: common words like "the" or "is" appear frequently everywhere and do not help distinguish documents from one another. TF-IDF downweights these common terms and upweights rare, distinctive ones.

The formula has two components:

Term Frequency (TF): The number of times a term appears in a document, often normalized by the total number of terms in that document.
Inverse Document Frequency (IDF): The logarithm of the total number of documents divided by the number of documents containing the term.

TF-IDF(t, d) = TF(t, d) x log(N / DF(t))

Where t is the term, d is the document, N is the total number of documents, and DF(t) is the number of documents containing term t.

A word that appears frequently in one document but rarely across the corpus receives a high TF-IDF score, signaling that it is particularly relevant to that document.[3]

There are several common variants of the IDF expression. The classical Sparck Jones IDF is log(N / DF(t)). Scikit-learn uses a smoothed version, log((1 + N) / (1 + DF(t))) + 1, which avoids division by zero for unseen terms. Many implementations apply L2 normalization so that document length cancels out under cosine similarity. Combined with cosine similarity, normalized TF-IDF is essentially the lingua franca of classical IR. See the dedicated TF-IDF article for more detail.

BM25 weighting

BM25, formally Okapi BM25, is a probabilistic relevance-ranking function published by Stephen Robertson and Karen Sparck Jones in their 1994 and 1995 TREC papers and refined into its standard form in the late 1990s.[5] Although BM25 is more often described as a ranking function rather than a feature, it can be viewed as a non-linear weighting scheme over the same BoW vocabulary as TF-IDF. The standard formula is:

BM25(q, d) = sum over t in q of
  IDF(t) * (f(t,d) * (k1 + 1)) / (f(t,d) + k1 * (1 - b + b * |d| / avgdl))

where f(t,d) is the frequency of term t in document d, |d| is the length of d, avgdl is the average document length, and the parameters k1 (typically 1.2 to 2.0) and b (typically 0.75) control term-frequency saturation and length normalization. BM25 saturates: doubling the count of a term does not double its score, preventing a stuffed keyword from dominating. BM25 also normalizes for document length. These two corrections explain why BM25 has been the workhorse of full-text search for nearly thirty years and remains the default similarity in Lucene, Elasticsearch, OpenSearch, and Solr. See the dedicated BM25 article for derivations and tuning.

worked example

Consider three short documents:

Doc 1: "I love machine learning"
Doc 2: "I love deep learning"
Doc 3: "deep learning is fascinating"

After lowercasing and removing stopwords ("I," "is"), the vocabulary is: [love, machine, learning, deep, fascinating].

The count vectors are:

Document	love	machine	learning	deep	fascinating
Doc 1	1	1	1	0	0
Doc 2	1	0	1	1	0
Doc 3	0	0	1	1	1

Notice that "learning" appears in all three documents, so it would receive a low IDF score under TF-IDF weighting. Meanwhile, "machine" and "fascinating" each appear in only one document, so they would receive high IDF scores and help distinguish those documents.

Under TF-IDF weighting (smoothed scikit-learn variant with L2 normalization), "learning" receives a smaller weight than "machine" or "fascinating" in their respective documents, because IDF tells the model that "learning" is uninformative within this corpus. Cosine similarity between Doc 1 and Doc 2 (which share "love" and "learning") is higher than between Doc 1 and Doc 3, matching human intuition that the first two documents are more similar.

n-gram extensions

A major weakness of the standard BoW model is that it treats each word independently, losing all information about word order. The n-gram extension addresses this by considering sequences of consecutive words as single tokens rather than individual words alone.

A unigram model (n=1) is the standard bag of words
A bigram model (n=2) includes two-word phrases like "machine learning" or "not good"
A trigram model (n=3) captures three-word sequences like "natural language processing"

Using bigrams allows the model to partially capture word order. For instance, the phrases "not good" and "very good" become distinct features rather than being collapsed into the same set of individual words. Research on email spam classification has shown that trigram and 4-gram features can achieve classification accuracy above 98%.[7] In practice, combining unigrams with bigrams tends to offer the best balance between capturing local word order and keeping the vocabulary manageable.

n-gram order	Example tokens for "natural language processing is fun"	Vocabulary size impact	Notes
Unigram (n=1)	natural, language, processing, is, fun	Baseline V	Pure BoW; loses all word order
Bigram (n=2)	natural language, language processing, processing is, is fun	Up to V^2 in theory; in practice 5-15x V	Captures common collocations like "not good"
Trigram (n=3)	natural language processing, language processing is, processing is fun	Even sparser; up to V^3	Useful for short genre-specific corpora
Char 3-gram	nat, atu, tur, ura, ral, ...	Bounded by alphabet^3	Robust to typos; used in language ID and spam filtering
Char 4-gram	natu, atur, tura, ural, ...	Bounded by alphabet^4	Standard in fastText subword models

Character n-grams are a cousin of word n-grams. They produce smaller and bounded vocabularies (limited by the alphabet) and tend to be robust to typographic noise, which is why early spam filters and forensic authorship attribution systems leaned on character 3-grams and 4-grams.[7] fastText reuses the same idea in a learned-embedding setting.

implementation in python

The scikit-learn library provides two main classes for bag of words: CountVectorizer for raw counts and TfidfVectorizer for TF-IDF weighted vectors. Both follow the standard fit-transform pattern.

CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "I love machine learning",
    "I love deep learning",
    "deep learning is fascinating"
]

vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names_out())
# ['deep', 'fascinating', 'learning', 'love', 'machine']

print(X.toarray())
# [[0, 0, 1, 1, 1],
#  [1, 0, 1, 1, 0],
#  [1, 1, 1, 0, 0]]

TfidfVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)

print(X.toarray())
# Each value is now a TF-IDF weighted float
# instead of a raw integer count

Both classes support n-gram ranges through the ngram_range parameter. Setting ngram_range=(1, 2) includes both unigrams and bigrams.[8]

Scikit-learn internally stores BoW matrices as sparse matrices (using scipy.sparse.csr_matrix), which is essential for handling the high-dimensional, mostly-zero vectors that BoW produces.[9]

using the hashing trick for streaming data

from sklearn.feature_extraction.text import HashingVectorizer

hasher = HashingVectorizer(n_features=2**18, alternate_sign=False)
X = hasher.transform(corpus)

HashingVectorizer does not maintain a vocabulary, so it can be applied to a stream of new documents without coordinating a global token table. The trade-off is collisions. With 2^18 buckets and a corpus of 100,000 unique tokens, collision rates stay low.

a full text-classification pipeline

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import cross_val_score

train = fetch_20newsgroups(subset='train', categories=['sci.med', 'sci.space'])

pipe = Pipeline([
    ('vec', TfidfVectorizer(
        stop_words='english',
        ngram_range=(1, 2),
        min_df=2,
        max_df=0.9,
        sublinear_tf=True,
    )),
    ('clf', LogisticRegression(max_iter=1000, C=1.0)),
])

scores = cross_val_score(pipe, train.data, train.target, cv=5)
print(scores.mean())

A TF-IDF plus logistic regression pipeline like this one routinely scores in the high 90s on the binary medicine vs. space split of 20 Newsgroups, which is one of the reasons BoW remains the sanity-check baseline for any new neural text classifier.

gensim and topic models

The gensim library uses a slightly different idiom in which a Dictionary builds the vocabulary and corpus iterators yield (token_id, count) tuples rather than dense vectors. This streaming-friendly format makes gensim a natural choice for fitting large topic models such as LDA.

from gensim.corpora import Dictionary
from gensim.models import LdaModel

texts = [doc.lower().split() for doc in corpus]

dictionary = Dictionary(texts)
bow_corpus = [dictionary.doc2bow(text) for text in texts]

lda = LdaModel(corpus=bow_corpus, id2word=dictionary, num_topics=2, passes=10)

LDA only operates on integer counts, so the bag of words representation is mandatory. This dependence is one of the reasons BoW continues to ship in modern data-science stacks even when the rest of the pipeline uses dense embeddings.

classical NLP pipeline with BoW

A typical end-to-end BoW pipeline for text classification has the following stages:

Stage	Tools	Purpose	Comment
Cleanup	regex, BeautifulSoup	Strip HTML, lowercase, remove URLs and emails	Critical for web-scraped text
Tokenization	regex, NLTK, spaCy	Split text into tokens	Choice between word, subword, or character tokens
Stopword removal	NLTK, sklearn lists	Remove uninformative high-frequency tokens	Sometimes hurts; topic models often want stopwords
Normalization	Porter, Snowball, WordNet	Stemming or lemmatization	Stemming faster, lemmatization more precise
Vocabulary trimming	min_df / max_df	Drop rare typos and ultra-frequent terms	Often more impactful than stopword removal
Vectorization	CountVectorizer, TfidfVectorizer	Build sparse DTM	Use n-gram ranges of (1, 2) as a default
Weighting	TF-IDF, BM25	Down-weight common, up-weight discriminative	Sublinear TF often helps
Modeling	naive Bayes, logistic regression, SVM	Fit a classifier or ranker	Linear models scale well to high-dim sparse input
Evaluation	accuracy, F1, MAP, NDCG	Score predictions	Pick a metric matched to the task

The pipeline is deliberately modular: each step has a well-tested implementation in scikit-learn, spaCy, or NLTK. When a transformer fails on a niche corpus, an engineer can usually stand up a TF-IDF baseline in an afternoon, ship it, and revisit the deep-learning option later.

applications

Despite its simplicity, the bag of words model has proven effective across a wide range of tasks.

Application	How BoW Is Used	Typical Classifiers
Sentiment analysis	Documents are vectorized and classified as positive, negative, or neutral based on word frequencies	Naive Bayes, logistic regression, SVM
Spam detection	Emails are converted to BoW vectors; spam-indicative words receive high weights	Naive Bayes, random forest
Topic modeling	BoW matrices serve as input to algorithms that discover latent topics	Latent Dirichlet Allocation (LDA), Non-negative Matrix Factorization (NMF)
Document classification	BoW vectors enable assigning documents to predefined categories	SVM, logistic regression, decision tree
Information retrieval	TF-IDF weighted BoW vectors are used to rank documents by relevance to a search query	Cosine similarity ranking, BM25
Authorship attribution	Stylometric features based on word frequencies help identify the author of a text	SVM, neural network
Plagiarism detection	Pairwise cosine similarity over BoW vectors flags near-duplicate documents	Threshold cosine, MinHash
Document clustering	Sparse TF-IDF vectors are clustered with k-means or spectral methods	k-means, agglomerative clustering
News deduplication	Compact BoW signatures identify near-duplicate stories in real-time feeds	LSH over TF-IDF
Patent prior-art search	TF-IDF vectors over patent texts retrieve related patents and publications	BM25
Legal e-discovery	TF-IDF and Boolean BoW retrieval surface relevant documents during litigation	Logistic regression on TF-IDF

sentiment analysis at scale

The textbook example of BoW + Naive Bayes for sentiment analysis runs on the IMDb movie review dataset of 50,000 polarity-labeled reviews collected by Maas et al. for a 2011 ACL paper.[10] A unigram TF-IDF classifier with logistic regression typically scores about 88% accuracy on that benchmark, well below modern transformer scores in the mid-90s but remarkable for a bag-of-counts model that fits in a few hundred megabytes of RAM and trains in seconds. For sentiment analysis on very long documents (legal filings, financial reports), BoW often outperforms small transformer fine-tunes because document-level sentiment is essentially a frequency phenomenon.

spam filtering and email triage

Naive Bayes spam filters trace their lineage to Sahami, Dumais, Heckerman, and Horvitz at Microsoft Research in 1998 and to Paul Graham's 2002 essay "A Plan for Spam."[11] Both approaches treat email as a bag of words, compute the conditional probability of each token under the spam and ham classes, and combine the per-token probabilities into a single message-level score using Bayes' rule with an independence assumption. SpamAssassin, the de-facto open-source filter on UNIX mail servers since the early 2000s, uses a hybrid of hand-written rules and BoW Naive Bayes scoring. BoW Naive Bayes remains a live component of large-scale spam pipelines because it is fast, interpretable, and easy to update online with new tokens.

topic modeling with LDA

David Blei, Andrew Ng, and Michael Jordan introduced Latent Dirichlet Allocation (LDA) in their 2003 JMLR paper.[12] LDA assumes each document is a mixture of latent topics, and each topic is a probability distribution over words. The input is a BoW count matrix; the output is two factorized matrices, one mapping documents to topic mixtures and the other mapping topics to word distributions. LDA's likelihood is defined over discrete word counts, not real-valued embeddings, so it cannot consume TF-IDF vectors directly. This is one practical reason BoW count vectors are still the standard input format for topic models.

information retrieval

Full-text search has been the canonical BoW application since the SMART system. Modern search engines such as Apache Lucene (and the Elasticsearch, OpenSearch, and Solr products built on top of it) maintain inverted indexes that map each vocabulary term to the list of documents containing it. At query time the engine retrieves the posting lists for query terms, scores each candidate document with BM25 or a TF-IDF variant, and returns the top-k by score. A 2019 study by Yang, Lin, and Lin showed that BM25 over standard TREC collections is still competitive with neural rerankers for many query types, especially when the queries are short and keyword-driven.[13]

bag of visual words in computer vision

The bag of words concept has been adapted for computer vision under the name "bag of visual words" (BoVW). Instead of counting text words, BoVW counts visual features extracted from images. The process follows three steps:

Feature extraction: Local features are detected in images using algorithms like Scale-Invariant Feature Transform (SIFT), which converts each image patch into a 128-dimensional descriptor vector that is invariant to scale, rotation, and illumination changes.
Codebook construction: The extracted feature descriptors from many images are clustered using k-means clustering. Each cluster center becomes a "visual word," and the collection of all cluster centers forms a "codebook" (analogous to a text vocabulary).
Histogram generation: Each image is represented as a histogram counting how many of its feature descriptors fall into each visual word cluster.

The seminal paper introducing BoVW for image retrieval was Sivic and Zisserman's 2003 "Video Google" system, which adapted text retrieval techniques to retrieve object instances in video frames.[14] Csurka et al.'s 2004 paper "Visual Categorization with Bags of Keypoints" applied the same idea to image classification with SIFT keypoints and a k-means codebook.[15] BoVW was one of the most successful methods for image classification and content-based image retrieval before the rise of convolutional neural networks. Alternatives like Vector of Locally Aggregated Descriptors (VLAD) and Fisher Vectors improved upon BoVW by encoding higher-order statistics about the feature distribution.[15] BoVW lingers in classical computer-vision pipelines for visual SLAM (where loop-closure detection often runs on bags of binary descriptors) and large-scale image retrieval where memory and inference cost dominate.

limitations

The bag of words model has several well-known shortcomings that limit its effectiveness for more complex language understanding tasks.

loss of word order

Because BoW treats text as an unordered collection of words, it cannot distinguish between sentences with different meanings that use the same words. "The dog bit the man" and "The man bit the dog" produce identical BoW vectors. Negation is another problem: "good" and "not good" are near-opposites, but a BoW model sees them as similar since they share the word "good."[16] Bigrams partially address this by introducing tokens like "not_good," but the fundamental issue persists for any structure that crosses more than two-token windows.

high dimensionality

The vocabulary size directly determines the vector length. For a realistic corpus, vocabularies can easily reach 50,000 to 100,000 words or more. Each document becomes a vector of that length, which can slow down model training and increase memory usage. This problem, known as the "curse of dimensionality," can also degrade classifier performance when the number of features far exceeds the number of training samples.

sparsity

Since any individual document uses only a small fraction of the total vocabulary, the resulting vectors are extremely sparse (mostly zeros). While sparse matrix formats mitigate the storage problem, the sparsity itself can reduce the effectiveness of distance-based algorithms that rely on dense representations. Methods like latent semantic analysis (LSA), which applies truncated singular value decomposition to the BoW matrix, were originally designed to address this sparsity by projecting documents into a denser, lower-dimensional latent space.

no semantic understanding

BoW treats every word as an independent, orthogonal dimension with no notion of similarity. The words "happy" and "joyful" are as different as "happy" and "bicycle" in a BoW representation. The model cannot capture synonymy, polysemy, or any other semantic relationship between words.

out-of-vocabulary handling

At inference time, any token that did not appear in the training vocabulary is silently dropped. For long-tail or evolving domains (product names, hashtags, slang), the BoW pipeline therefore degrades over time without an obvious failure signal. Hashing vectorizers sidestep this problem by mapping any token (seen or unseen) to a bucket, but at the cost of collisions and zero interpretability.

sensitivity to preprocessing

Results can swing several percentage points based on tokenization, stopword list, and stemmer choices. There is no universal best preprocessing pipeline. Practitioners typically grid-search over preprocessing combinations using cross-validation, which is computationally cheap because BoW vectorization is fast.

comparison with word embeddings

Modern word embedding methods like Word2Vec, GloVe, fastText, and contextual embeddings from transformer models address many of the limitations of BoW. The table below summarizes the key differences.

Property	Bag of Words	Word Embeddings
Vector type	Sparse, high-dimensional	Dense, low-dimensional (50-300 dims typical)
Semantic similarity	Not captured	Captured (similar words have similar vectors)
Word order	Completely ignored	Partially captured (contextual embeddings fully capture it)
Vocabulary dependence	Fixed vocabulary, out-of-vocabulary words are lost	Subword methods handle unseen words
Interpretability	High (each dimension corresponds to a known word)	Low (dimensions lack direct interpretation)
Computational cost for representation	Low (simple counting)	Higher (requires pre-trained model)
Training data requirement	Works with small datasets	Pre-trained models need large corpora

Word2Vec, introduced by Mikolov et al. in 2013, learns dense word vectors from local context windows using either a skip-gram or continuous bag of words objective.[17] Pennington, Socher, and Manning's 2014 GloVe algorithm trains on global co-occurrence statistics from the corpus.[18] Bojanowski et al.'s 2016 fastText adds character n-gram averaging on top of word2vec.[19] Devlin et al.'s 2018 BERT and the broader transformer family produce contextual embeddings: the same word receives different vectors in different sentences, breaking the static-token assumption shared by BoW and word2vec alike.[20]

Bag of words remains a strong choice when interpretability matters, training data is limited, or the task is simple enough that word order and semantics are less important. For tasks requiring deeper language understanding, embeddings and one-hot encoding-based methods that feed into neural networks are generally preferred.[21]

a useful conceptual bridge

Word2Vec's CBOW (continuous bag of words) variant predicts a target word from the average of its context word vectors. The name reuses "bag of words" because the context vectors are added without regard for order. Although Word2Vec is a dense embedding model, one of its two training objectives is built on a literal bag-of-words view of local context. The bag idea outlived its sparse representation.

bow as a baseline in modern NLP

In 2026, few production systems rely on BoW alone. Yet BoW remains ubiquitous as a baseline and as a feature pipeline. Three patterns recur:

Sanity-check baselines. When a team trains a new transformer fine-tune for text classification, the first comparison is almost always against TF-IDF + logistic regression. If the deep model does not beat that baseline, the labels are noisy, the metric is wrong, or the model is underfit. The TF-IDF baseline is genuinely difficult to beat on tasks like topic classification with abundant labeled data and short documents.[22]
Hybrid retrieval. Modern dense-retrieval systems (DPR, ColBERT, contriever) often pair their neural rankers with a BM25 first-stage retriever. The BM25 stage handles vocabulary mismatch and rare query terms while the neural ranker handles semantic matching. This sparse-dense fusion is the dominant pattern in industrial search and retrieval-augmented generation pipelines.[13]
Lightweight production classifiers. Spam filters, content moderation triage, and ticket routing systems often use BoW + linear classifiers because they are cheap to retrain and give per-token interpretability that helps with debugging.

extensions and variants

Several models elaborate on the BoW idea by preserving the bag and changing the weighting or factorization:

Latent Semantic Analysis (LSA). Truncated SVD applied to a TF-IDF matrix.
Probabilistic Latent Semantic Analysis (pLSA). Hofmann's 1999 model gave LSA a probabilistic generative interpretation.
Latent Dirichlet Allocation (LDA). Blei, Ng, and Jordan, 2003. Uses BoW counts as input.[12]
Non-negative Matrix Factorization (NMF). Lee and Seung, 1999. Factors a non-negative count matrix into document-topic and topic-word matrices.
Continuous bag of words (CBOW). The Word2Vec variant where context vectors are averaged.
Bag of n-grams. Generalizes BoW to multi-word tokens.
Bag of visual words (BoVW). Computer-vision analog using SIFT or ORB descriptors.
Bag of audio words. MFCC or other acoustic descriptors clustered into audio words.

common pitfalls

A short field guide for engineers running BoW in production:

Train and test must use the same vocabulary. Always serialize the fitted vectorizer alongside the trained model.
Stopword removal can hurt topic models. LDA actually benefits from having stopwords if the corpus is small.
Class imbalance. BoW classifiers are very sensitive to class imbalance because the prior probability flows directly into Naive Bayes scores.
Feature explosion with high n-grams. Cap with min_df and max_features.
Length effects. Either L2-normalize TF-IDF vectors or use BM25 to avoid length bias.

explain like I'm 5 (ELI5)

Imagine you have a big box of toy blocks. Each block has a word written on it. When you read a story, you grab a block for every word in that story and put it in a bag. You do not care about the order the words appeared in. You just count how many times each word showed up. A story about cats might have three "cat" blocks, two "fish" blocks, and one "sleep" block. A story about dogs might have four "dog" blocks and one "park" block. By looking at what is in each bag, a computer can tell the two stories are about different things, even though it never read them like you would. The bag is also mostly empty: it is sized to hold any of 50,000 words but a typical story only uses 200 of them. That is what computer scientists mean when they say BoW vectors are sparse and high-dimensional.

references

Harris, Zellig S. (1954). "Distributional Structure." Word, 10(2-3), 146-162.
Salton, Gerard; Wong, A.; Yang, C. S. (1975). "A Vector Space Model for Automatic Indexing." Communications of the ACM, 18(11), 613-620.
Salton, Gerard; Buckley, Christopher (1988). "Term-weighting approaches in automatic text retrieval." Information Processing & Management, 24(5), 513-523.
Sparck Jones, Karen (1972). "A statistical interpretation of term specificity and its application in retrieval." Journal of Documentation, 28(1), 11-21.
Robertson, Stephen E.; Walker, Steve; Jones, Susan; Hancock-Beaulieu, Micheline; Gatford, Mike (1995). "Okapi at TREC-3." Proceedings of the 3rd Text REtrieval Conference (TREC-3), 109-126.
Sahami, Mehran; Dumais, Susan; Heckerman, David; Horvitz, Eric (1998). "A Bayesian Approach to Filtering Junk E-Mail." AAAI Workshop on Learning for Text Categorization.
Kanaris, Ioannis et al. (2007). "Words versus Character N-Grams for Anti-Spam Filtering." International Journal on Artificial Intelligence Tools, 16(06), 1047-1067.
Pedregosa, Fabian et al. (2011). "Scikit-learn: Machine Learning in Python." Journal of Machine Learning Research, 12, 2825-2830.
scikit-learn documentation. "Feature extraction: Text feature extraction." https://scikit-learn.org/stable/modules/feature_extraction.html
Maas, Andrew L.; Daly, Raymond E.; Pham, Peter T.; Huang, Dan; Ng, Andrew Y.; Potts, Christopher (2011). "Learning Word Vectors for Sentiment Analysis." Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 142-150.
Graham, Paul (2002). "A Plan for Spam." http://www.paulgraham.com/spam.html
Blei, David M.; Ng, Andrew Y.; Jordan, Michael I. (2003). "Latent Dirichlet Allocation." Journal of Machine Learning Research, 3, 993-1022.
Yang, Wei; Lu, Kuang; Lin, Jimmy (2019). "Critically Examining the 'Neural Hype': Weak Baselines and the Additivity of Effectiveness Gains from Neural Ranking Models." Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, 1129-1132.
Sivic, Josef; Zisserman, Andrew (2003). "Video Google: A Text Retrieval Approach to Object Matching in Videos." Proceedings of the IEEE International Conference on Computer Vision, 1470-1477.
Csurka, Gabriella; Dance, Christopher R.; Fan, Lixin; Willamowski, Jutta; Bray, Cedric (2004). "Visual Categorization with Bags of Keypoints." Workshop on Statistical Learning in Computer Vision (ECCV), 1-22.
Manning, Christopher D.; Raghavan, Prabhakar; Schutze, Hinrich (2008). Introduction to Information Retrieval. Cambridge University Press.
Mikolov, Tomas; Chen, Kai; Corrado, Greg; Dean, Jeffrey (2013). "Efficient Estimation of Word Representations in Vector Space." Proceedings of ICLR Workshop.
Pennington, Jeffrey; Socher, Richard; Manning, Christopher D. (2014). "GloVe: Global Vectors for Word Representation." Proceedings of EMNLP, 1532-1543.
Bojanowski, Piotr; Grave, Edouard; Joulin, Armand; Mikolov, Tomas (2017). "Enriching Word Vectors with Subword Information." Transactions of the Association for Computational Linguistics, 5, 135-146.
Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." Proceedings of NAACL-HLT, 4171-4186.
Jurafsky, Daniel; Martin, James H. (2024). Speech and Language Processing, 3rd edition draft.
Wang, Sida; Manning, Christopher D. (2012). "Baselines and Bigrams: Simple, Good Sentiment and Topic Classification." Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, 90-94.

historical background

how it works

tokenization

vocabulary construction

vectorization

weighting schemes

count vectorization

binary vectorization

TF-IDF weighting

BM25 weighting

worked example

n-gram extensions

implementation in python

CountVectorizer

TfidfVectorizer

using the hashing trick for streaming data

a full text-classification pipeline

gensim and topic models

classical NLP pipeline with BoW

applications

sentiment analysis at scale

spam filtering and email triage

topic modeling with LDA

information retrieval

bag of visual words in computer vision

limitations

loss of word order

high dimensionality

sparsity

no semantic understanding

out-of-vocabulary handling

sensitivity to preprocessing

comparison with word embeddings

a useful conceptual bridge

bow as a baseline in modern NLP

extensions and variants

common pitfalls

explain like I'm 5 (ELI5)

see also

references

Improve this article

Related Articles

ARC-AGI 2

Agentic Context Engineering

Claude Sonnet 4.5

Computer-use agent

Computer-use model

Context window

historical background

how it works

tokenization

vocabulary construction

vectorization

weighting schemes

count vectorization

binary vectorization

TF-IDF weighting

BM25 weighting

worked example

n-gram extensions

implementation in python

CountVectorizer

TfidfVectorizer

using the hashing trick for streaming data

a full text-classification pipeline

gensim and topic models

classical NLP pipeline with BoW

applications

sentiment analysis at scale

spam filtering and email triage

topic modeling with LDA

information retrieval

bag of visual words in computer vision

limitations

loss of word order

high dimensionality

sparsity

no semantic understanding

out-of-vocabulary handling

sensitivity to preprocessing