Bag of Words
Last reviewed
May 9, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v4 · 5,991 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 9, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v4 · 5,991 words
Add missing citations, update stale details, or suggest a clearer explanation.
The bag of words (BoW) model is one of the simplest and most widely used methods for representing text as numerical data in machine learning and natural language processing. It converts a document into a fixed-length vector by counting the occurrences of each word from a predefined vocabulary, discarding grammar and word order in the process. Despite its simplicity, the bag of words model has served as a foundational text representation technique for decades and remains relevant for many practical applications, particularly in classical information retrieval, naive Bayes spam classifiers, baseline text classifiers, and as a feature pipeline that feeds into topic modeling algorithms.
In its most general statement, BoW commits to a fixed vocabulary V of size |V| and maps every document d to a vector x in R^|V|, where component x_i records how often word i occurs in d. The exact mapping depends on the weighting scheme: raw counts, binary presence flags, normalized counts, TF-IDF weights, or BM25 scores. Whatever syntax, ordering, and discourse structure the original text contained gets thrown out, leaving a single vector that summarizes which words appeared and how often. The surprise of BoW is how often that crude summary turns out to be enough.
The conceptual roots of the bag of words model trace back to linguist Zellig Harris's 1954 article "Distributional Structure," which explored the idea that the distribution of words in context carries meaningful information about language. Harris notably observed that "language is not merely a bag of words but a tool with particular properties which have been fashioned in the course of its use," an ironic early usage of the phrase that would later lend its name to the model.[1] Harris's broader distributional hypothesis, which states that words occurring in similar contexts tend to have similar meanings, laid the theoretical groundwork for many statistical approaches to language, including BoW.
The explicit formalization of BoW as a numerical representation came from work in information retrieval during the 1960s and 1970s. Gerard Salton and his collaborators at Cornell developed the SMART Information Retrieval System, which represented documents and queries as numeric vectors over a vocabulary. Their 1975 paper "A Vector Space Model for Automatic Indexing" set out the geometry that underlies modern retrieval: documents are points in a high-dimensional space, queries are also points, and similarity reduces to the cosine of the angle between two vectors.[2] The vocabulary defines the axes and BoW counts give the coordinates.
A key refinement followed in 1988 when Salton and Christopher Buckley published "Term-weighting approaches in automatic text retrieval," cataloging the family of TF-IDF style weights that would dominate text retrieval for the next quarter century.[3] Karen Sparck Jones had already published her 1972 paper introducing inverse document frequency as a way to scale term importance based on rarity.[4] The Okapi BM25 weighting function, developed by Stephen Robertson and his collaborators at City University in London during the early 1990s, became the standard probabilistic refinement of TF-IDF and is still the default ranker in many open-source search engines today.[5]
The bag of words model gained practical traction outside retrieval in the late 1990s and early 2000s as researchers applied it to text classification and spam filtering. The early Naive Bayes spam filters of Sahami, Dumais, Heckerman, and Horvitz (1998) and the rule-based filter SpamAssassin both treated email as a bag of words and scored each message by the conditional probability of its tokens under spam vs. ham distributions.[6] Its straightforward implementation and reasonable performance on many tasks made BoW a default baseline for text analysis long before the rise of deep learning and word embedding methods.
The bag of words pipeline involves three main stages: tokenization, vocabulary construction, and vectorization. In production there is usually a fourth stage of vocabulary trimming and a fifth weighting step that converts raw counts to TF-IDF or BM25 scores before the vector reaches a classifier.
The first step is breaking raw text into individual units called tokens. In the simplest case, tokens are individual words separated by whitespace. More sophisticated tokenizers handle punctuation, contractions, and special characters. Common preprocessing steps applied during or after tokenization include:
Tokenization choices have a surprisingly large effect on downstream model quality. A whitespace tokenizer that ignores punctuation will treat "don't" as a single token, while a regex tokenizer using \w+ will split it into "don" and "t." What matters is consistency between training and inference. Stemming reduces "organize," "organizing," and "organization" to the same stem (typically "organ" under the Porter stemmer), shrinking the vocabulary at the cost of some semantic precision. Lemmatization is more linguistically faithful: it normalizes inflectional forms while preserving stems, mapping "better" to "good" and "organized" to "organize." The Porter stemmer, the Snowball stemmer, and the WordNet lemmatizer (via NLTK) cover most English use cases.
After tokenization, the model builds a vocabulary: a list of all unique tokens found across the entire document collection (corpus). Each unique token is assigned an index position. For example, given two sentences:
After lowercasing and removing the stopword "the" and "on," the vocabulary might be: [cat, sat, mat, dog, log].
For a real corpus the vocabulary will not be five tokens. The 20 Newsgroups dataset produces a vocabulary of about 100,000 distinct word types after light cleaning. The English Wikipedia dump contains tens of millions of distinct tokens, most of them rare. The trick most pipelines use is vocabulary trimming: drop any token that appears in fewer than min_df documents (often 2 or 5) and any token that appears in more than max_df (often 0.9 or 0.95). This removes typos, OCR errors, and stopword-like tokens without a custom stopword list.
A related technique called the hashing trick sidesteps the vocabulary entirely. The system applies a fast hash function that maps every token to one of a fixed number of buckets (say 2^18 or 2^20). The output vector has that fixed size regardless of corpus, with the trade-off that distinct tokens can collide into the same bucket. Scikit-learn's HashingVectorizer implements this idea and is the standard choice for streaming or out-of-core text classification.
Each document is then represented as a numerical vector whose length equals the size of the vocabulary. The value at each position depends on the weighting scheme used. The collection of all such document vectors stacked into a matrix is called the document-term matrix, sometimes abbreviated DTM. With N documents and a vocabulary of size |V|, the DTM is an N by |V| matrix, almost always stored sparsely because the typical document only touches a few hundred of the tens of thousands of vocabulary terms.
Several approaches exist for assigning values to the vector positions. The choice of scheme can significantly affect downstream model performance.
| Scheme | Description | Value at Position i | Best For |
|---|---|---|---|
| Count (frequency) | Counts how many times each word appears in the document | Number of occurrences of word i | General text classification |
| Binary | Records only whether a word is present or absent | 1 if word i is present, 0 otherwise | Short documents, presence-based tasks |
| Log-frequency | Log-dampened raw counts | 1 + log(count) if count > 0, else 0 | Reducing the weight of repeated tokens |
| TF-IDF | Adjusts word counts by how common the word is across all documents | TF(i) x IDF(i) | Information retrieval, distinguishing important terms |
| Normalized frequency | Divides raw counts by the total number of words in the document | Count of word i / total words | Comparing documents of different lengths |
| BM25 | Probabilistic ranking score with length normalization | See BM25 formula below | Search ranking, robust to long documents |
The most basic form of BoW uses raw word counts. If the word "learning" appears three times in a document, the corresponding vector position has a value of 3. This approach is intuitive but can give disproportionate weight to frequent but uninformative words.
Binary BoW simply marks whether a word appears (1) or does not appear (0) in a document, ignoring frequency entirely. This works well for short texts where word repetition is rare, such as tweets or product reviews. Binary BoW is also the natural input for multivariate Bernoulli naive Bayes, where each document is modeled as a draw of |V| independent Bernoulli variables that fire only on word presence.
Term Frequency-Inverse Document Frequency (TF-IDF) is the most popular extension of the basic BoW model. It addresses a key weakness of raw counts: common words like "the" or "is" appear frequently everywhere and do not help distinguish documents from one another. TF-IDF downweights these common terms and upweights rare, distinctive ones.
The formula has two components:
TF-IDF(t, d) = TF(t, d) x log(N / DF(t))
Where t is the term, d is the document, N is the total number of documents, and DF(t) is the number of documents containing term t.
A word that appears frequently in one document but rarely across the corpus receives a high TF-IDF score, signaling that it is particularly relevant to that document.[3]
There are several common variants of the IDF expression. The classical Sparck Jones IDF is log(N / DF(t)). Scikit-learn uses a smoothed version, log((1 + N) / (1 + DF(t))) + 1, which avoids division by zero for unseen terms. Many implementations apply L2 normalization so that document length cancels out under cosine similarity. Combined with cosine similarity, normalized TF-IDF is essentially the lingua franca of classical IR. See the dedicated TF-IDF article for more detail.
BM25, formally Okapi BM25, is a probabilistic relevance-ranking function published by Stephen Robertson and Karen Sparck Jones in their 1994 and 1995 TREC papers and refined into its standard form in the late 1990s.[5] Although BM25 is more often described as a ranking function rather than a feature, it can be viewed as a non-linear weighting scheme over the same BoW vocabulary as TF-IDF. The standard formula is:
BM25(q, d) = sum over t in q of
IDF(t) * (f(t,d) * (k1 + 1)) / (f(t,d) + k1 * (1 - b + b * |d| / avgdl))
where f(t,d) is the frequency of term t in document d, |d| is the length of d, avgdl is the average document length, and the parameters k1 (typically 1.2 to 2.0) and b (typically 0.75) control term-frequency saturation and length normalization. BM25 saturates: doubling the count of a term does not double its score, preventing a stuffed keyword from dominating. BM25 also normalizes for document length. These two corrections explain why BM25 has been the workhorse of full-text search for nearly thirty years and remains the default similarity in Lucene, Elasticsearch, OpenSearch, and Solr. See the dedicated BM25 article for derivations and tuning.
Consider three short documents:
After lowercasing and removing stopwords ("I," "is"), the vocabulary is: [love, machine, learning, deep, fascinating].
The count vectors are:
| Document | love | machine | learning | deep | fascinating |
|---|---|---|---|---|---|
| Doc 1 | 1 | 1 | 1 | 0 | 0 |
| Doc 2 | 1 | 0 | 1 | 1 | 0 |
| Doc 3 | 0 | 0 | 1 | 1 | 1 |
Notice that "learning" appears in all three documents, so it would receive a low IDF score under TF-IDF weighting. Meanwhile, "machine" and "fascinating" each appear in only one document, so they would receive high IDF scores and help distinguish those documents.
Under TF-IDF weighting (smoothed scikit-learn variant with L2 normalization), "learning" receives a smaller weight than "machine" or "fascinating" in their respective documents, because IDF tells the model that "learning" is uninformative within this corpus. Cosine similarity between Doc 1 and Doc 2 (which share "love" and "learning") is higher than between Doc 1 and Doc 3, matching human intuition that the first two documents are more similar.
A major weakness of the standard BoW model is that it treats each word independently, losing all information about word order. The n-gram extension addresses this by considering sequences of consecutive words as single tokens rather than individual words alone.
Using bigrams allows the model to partially capture word order. For instance, the phrases "not good" and "very good" become distinct features rather than being collapsed into the same set of individual words. Research on email spam classification has shown that trigram and 4-gram features can achieve classification accuracy above 98%.[7] In practice, combining unigrams with bigrams tends to offer the best balance between capturing local word order and keeping the vocabulary manageable.
| n-gram order | Example tokens for "natural language processing is fun" | Vocabulary size impact | Notes |
|---|---|---|---|
| Unigram (n=1) | natural, language, processing, is, fun | Baseline V | Pure BoW; loses all word order |
| Bigram (n=2) | natural language, language processing, processing is, is fun | Up to V^2 in theory; in practice 5-15x V | Captures common collocations like "not good" |
| Trigram (n=3) | natural language processing, language processing is, processing is fun | Even sparser; up to V^3 | Useful for short genre-specific corpora |
| Char 3-gram | nat, atu, tur, ura, ral, ... | Bounded by alphabet^3 | Robust to typos; used in language ID and spam filtering |
| Char 4-gram | natu, atur, tura, ural, ... | Bounded by alphabet^4 | Standard in fastText subword models |
Character n-grams are a cousin of word n-grams. They produce smaller and bounded vocabularies (limited by the alphabet) and tend to be robust to typographic noise, which is why early spam filters and forensic authorship attribution systems leaned on character 3-grams and 4-grams.[7] fastText reuses the same idea in a learned-embedding setting.
The scikit-learn library provides two main classes for bag of words: CountVectorizer for raw counts and TfidfVectorizer for TF-IDF weighted vectors. Both follow the standard fit-transform pattern.
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
"I love machine learning",
"I love deep learning",
"deep learning is fascinating"
]
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())
# ['deep', 'fascinating', 'learning', 'love', 'machine']
print(X.toarray())
# [[0, 0, 1, 1, 1],
# [1, 0, 1, 1, 0],
# [1, 1, 1, 0, 0]]
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)
print(X.toarray())
# Each value is now a TF-IDF weighted float
# instead of a raw integer count
Both classes support n-gram ranges through the ngram_range parameter. Setting ngram_range=(1, 2) includes both unigrams and bigrams.[8]
Scikit-learn internally stores BoW matrices as sparse matrices (using scipy.sparse.csr_matrix), which is essential for handling the high-dimensional, mostly-zero vectors that BoW produces.[9]
from sklearn.feature_extraction.text import HashingVectorizer
hasher = HashingVectorizer(n_features=2**18, alternate_sign=False)
X = hasher.transform(corpus)
HashingVectorizer does not maintain a vocabulary, so it can be applied to a stream of new documents without coordinating a global token table. The trade-off is collisions. With 2^18 buckets and a corpus of 100,000 unique tokens, collision rates stay low.
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import cross_val_score
train = fetch_20newsgroups(subset='train', categories=['sci.med', 'sci.space'])
pipe = Pipeline([
('vec', TfidfVectorizer(
stop_words='english',
ngram_range=(1, 2),
min_df=2,
max_df=0.9,
sublinear_tf=True,
)),
('clf', LogisticRegression(max_iter=1000, C=1.0)),
])
scores = cross_val_score(pipe, train.data, train.target, cv=5)
print(scores.mean())
A TF-IDF plus logistic regression pipeline like this one routinely scores in the high 90s on the binary medicine vs. space split of 20 Newsgroups, which is one of the reasons BoW remains the sanity-check baseline for any new neural text classifier.
The gensim library uses a slightly different idiom in which a Dictionary builds the vocabulary and corpus iterators yield (token_id, count) tuples rather than dense vectors. This streaming-friendly format makes gensim a natural choice for fitting large topic models such as LDA.
from gensim.corpora import Dictionary
from gensim.models import LdaModel
texts = [doc.lower().split() for doc in corpus]
dictionary = Dictionary(texts)
bow_corpus = [dictionary.doc2bow(text) for text in texts]
lda = LdaModel(corpus=bow_corpus, id2word=dictionary, num_topics=2, passes=10)
LDA only operates on integer counts, so the bag of words representation is mandatory. This dependence is one of the reasons BoW continues to ship in modern data-science stacks even when the rest of the pipeline uses dense embeddings.
A typical end-to-end BoW pipeline for text classification has the following stages:
| Stage | Tools | Purpose | Comment |
|---|---|---|---|
| Cleanup | regex, BeautifulSoup | Strip HTML, lowercase, remove URLs and emails | Critical for web-scraped text |
| Tokenization | regex, NLTK, spaCy | Split text into tokens | Choice between word, subword, or character tokens |
| Stopword removal | NLTK, sklearn lists | Remove uninformative high-frequency tokens | Sometimes hurts; topic models often want stopwords |
| Normalization | Porter, Snowball, WordNet | Stemming or lemmatization | Stemming faster, lemmatization more precise |
| Vocabulary trimming | min_df / max_df | Drop rare typos and ultra-frequent terms | Often more impactful than stopword removal |
| Vectorization | CountVectorizer, TfidfVectorizer | Build sparse DTM | Use n-gram ranges of (1, 2) as a default |
| Weighting | TF-IDF, BM25 | Down-weight common, up-weight discriminative | Sublinear TF often helps |
| Modeling | naive Bayes, logistic regression, SVM | Fit a classifier or ranker | Linear models scale well to high-dim sparse input |
| Evaluation | accuracy, F1, MAP, NDCG | Score predictions | Pick a metric matched to the task |
The pipeline is deliberately modular: each step has a well-tested implementation in scikit-learn, spaCy, or NLTK. When a transformer fails on a niche corpus, an engineer can usually stand up a TF-IDF baseline in an afternoon, ship it, and revisit the deep-learning option later.
Despite its simplicity, the bag of words model has proven effective across a wide range of tasks.
| Application | How BoW Is Used | Typical Classifiers |
|---|---|---|
| Sentiment analysis | Documents are vectorized and classified as positive, negative, or neutral based on word frequencies | Naive Bayes, logistic regression, SVM |
| Spam detection | Emails are converted to BoW vectors; spam-indicative words receive high weights | Naive Bayes, random forest |
| Topic modeling | BoW matrices serve as input to algorithms that discover latent topics | Latent Dirichlet Allocation (LDA), Non-negative Matrix Factorization (NMF) |
| Document classification | BoW vectors enable assigning documents to predefined categories | SVM, logistic regression, decision tree |
| Information retrieval | TF-IDF weighted BoW vectors are used to rank documents by relevance to a search query | Cosine similarity ranking, BM25 |
| Authorship attribution | Stylometric features based on word frequencies help identify the author of a text | SVM, neural network |
| Plagiarism detection | Pairwise cosine similarity over BoW vectors flags near-duplicate documents | Threshold cosine, MinHash |
| Document clustering | Sparse TF-IDF vectors are clustered with k-means or spectral methods | k-means, agglomerative clustering |
| News deduplication | Compact BoW signatures identify near-duplicate stories in real-time feeds | LSH over TF-IDF |
| Patent prior-art search | TF-IDF vectors over patent texts retrieve related patents and publications | BM25 |
| Legal e-discovery | TF-IDF and Boolean BoW retrieval surface relevant documents during litigation | Logistic regression on TF-IDF |
The textbook example of BoW + Naive Bayes for sentiment analysis runs on the IMDb movie review dataset of 50,000 polarity-labeled reviews collected by Maas et al. for a 2011 ACL paper.[10] A unigram TF-IDF classifier with logistic regression typically scores about 88% accuracy on that benchmark, well below modern transformer scores in the mid-90s but remarkable for a bag-of-counts model that fits in a few hundred megabytes of RAM and trains in seconds. For sentiment analysis on very long documents (legal filings, financial reports), BoW often outperforms small transformer fine-tunes because document-level sentiment is essentially a frequency phenomenon.
Naive Bayes spam filters trace their lineage to Sahami, Dumais, Heckerman, and Horvitz at Microsoft Research in 1998 and to Paul Graham's 2002 essay "A Plan for Spam."[11] Both approaches treat email as a bag of words, compute the conditional probability of each token under the spam and ham classes, and combine the per-token probabilities into a single message-level score using Bayes' rule with an independence assumption. SpamAssassin, the de-facto open-source filter on UNIX mail servers since the early 2000s, uses a hybrid of hand-written rules and BoW Naive Bayes scoring. BoW Naive Bayes remains a live component of large-scale spam pipelines because it is fast, interpretable, and easy to update online with new tokens.
David Blei, Andrew Ng, and Michael Jordan introduced Latent Dirichlet Allocation (LDA) in their 2003 JMLR paper.[12] LDA assumes each document is a mixture of latent topics, and each topic is a probability distribution over words. The input is a BoW count matrix; the output is two factorized matrices, one mapping documents to topic mixtures and the other mapping topics to word distributions. LDA's likelihood is defined over discrete word counts, not real-valued embeddings, so it cannot consume TF-IDF vectors directly. This is one practical reason BoW count vectors are still the standard input format for topic models.
Full-text search has been the canonical BoW application since the SMART system. Modern search engines such as Apache Lucene (and the Elasticsearch, OpenSearch, and Solr products built on top of it) maintain inverted indexes that map each vocabulary term to the list of documents containing it. At query time the engine retrieves the posting lists for query terms, scores each candidate document with BM25 or a TF-IDF variant, and returns the top-k by score. A 2019 study by Yang, Lin, and Lin showed that BM25 over standard TREC collections is still competitive with neural rerankers for many query types, especially when the queries are short and keyword-driven.[13]
The bag of words concept has been adapted for computer vision under the name "bag of visual words" (BoVW). Instead of counting text words, BoVW counts visual features extracted from images. The process follows three steps:
The seminal paper introducing BoVW for image retrieval was Sivic and Zisserman's 2003 "Video Google" system, which adapted text retrieval techniques to retrieve object instances in video frames.[14] Csurka et al.'s 2004 paper "Visual Categorization with Bags of Keypoints" applied the same idea to image classification with SIFT keypoints and a k-means codebook.[15] BoVW was one of the most successful methods for image classification and content-based image retrieval before the rise of convolutional neural networks. Alternatives like Vector of Locally Aggregated Descriptors (VLAD) and Fisher Vectors improved upon BoVW by encoding higher-order statistics about the feature distribution.[15] BoVW lingers in classical computer-vision pipelines for visual SLAM (where loop-closure detection often runs on bags of binary descriptors) and large-scale image retrieval where memory and inference cost dominate.
The bag of words model has several well-known shortcomings that limit its effectiveness for more complex language understanding tasks.
Because BoW treats text as an unordered collection of words, it cannot distinguish between sentences with different meanings that use the same words. "The dog bit the man" and "The man bit the dog" produce identical BoW vectors. Negation is another problem: "good" and "not good" are near-opposites, but a BoW model sees them as similar since they share the word "good."[16] Bigrams partially address this by introducing tokens like "not_good," but the fundamental issue persists for any structure that crosses more than two-token windows.
The vocabulary size directly determines the vector length. For a realistic corpus, vocabularies can easily reach 50,000 to 100,000 words or more. Each document becomes a vector of that length, which can slow down model training and increase memory usage. This problem, known as the "curse of dimensionality," can also degrade classifier performance when the number of features far exceeds the number of training samples.
Since any individual document uses only a small fraction of the total vocabulary, the resulting vectors are extremely sparse (mostly zeros). While sparse matrix formats mitigate the storage problem, the sparsity itself can reduce the effectiveness of distance-based algorithms that rely on dense representations. Methods like latent semantic analysis (LSA), which applies truncated singular value decomposition to the BoW matrix, were originally designed to address this sparsity by projecting documents into a denser, lower-dimensional latent space.
BoW treats every word as an independent, orthogonal dimension with no notion of similarity. The words "happy" and "joyful" are as different as "happy" and "bicycle" in a BoW representation. The model cannot capture synonymy, polysemy, or any other semantic relationship between words.
At inference time, any token that did not appear in the training vocabulary is silently dropped. For long-tail or evolving domains (product names, hashtags, slang), the BoW pipeline therefore degrades over time without an obvious failure signal. Hashing vectorizers sidestep this problem by mapping any token (seen or unseen) to a bucket, but at the cost of collisions and zero interpretability.
Results can swing several percentage points based on tokenization, stopword list, and stemmer choices. There is no universal best preprocessing pipeline. Practitioners typically grid-search over preprocessing combinations using cross-validation, which is computationally cheap because BoW vectorization is fast.
Modern word embedding methods like Word2Vec, GloVe, fastText, and contextual embeddings from transformer models address many of the limitations of BoW. The table below summarizes the key differences.
| Property | Bag of Words | Word Embeddings |
|---|---|---|
| Vector type | Sparse, high-dimensional | Dense, low-dimensional (50-300 dims typical) |
| Semantic similarity | Not captured | Captured (similar words have similar vectors) |
| Word order | Completely ignored | Partially captured (contextual embeddings fully capture it) |
| Vocabulary dependence | Fixed vocabulary, out-of-vocabulary words are lost | Subword methods handle unseen words |
| Interpretability | High (each dimension corresponds to a known word) | Low (dimensions lack direct interpretation) |
| Computational cost for representation | Low (simple counting) | Higher (requires pre-trained model) |
| Training data requirement | Works with small datasets | Pre-trained models need large corpora |
Word2Vec, introduced by Mikolov et al. in 2013, learns dense word vectors from local context windows using either a skip-gram or continuous bag of words objective.[17] Pennington, Socher, and Manning's 2014 GloVe algorithm trains on global co-occurrence statistics from the corpus.[18] Bojanowski et al.'s 2016 fastText adds character n-gram averaging on top of word2vec.[19] Devlin et al.'s 2018 BERT and the broader transformer family produce contextual embeddings: the same word receives different vectors in different sentences, breaking the static-token assumption shared by BoW and word2vec alike.[20]
Bag of words remains a strong choice when interpretability matters, training data is limited, or the task is simple enough that word order and semantics are less important. For tasks requiring deeper language understanding, embeddings and one-hot encoding-based methods that feed into neural networks are generally preferred.[21]
Word2Vec's CBOW (continuous bag of words) variant predicts a target word from the average of its context word vectors. The name reuses "bag of words" because the context vectors are added without regard for order. Although Word2Vec is a dense embedding model, one of its two training objectives is built on a literal bag-of-words view of local context. The bag idea outlived its sparse representation.
In 2026, few production systems rely on BoW alone. Yet BoW remains ubiquitous as a baseline and as a feature pipeline. Three patterns recur:
Several models elaborate on the BoW idea by preserving the bag and changing the weighting or factorization:
A short field guide for engineers running BoW in production:
min_df and max_features.Imagine you have a big box of toy blocks. Each block has a word written on it. When you read a story, you grab a block for every word in that story and put it in a bag. You do not care about the order the words appeared in. You just count how many times each word showed up. A story about cats might have three "cat" blocks, two "fish" blocks, and one "sleep" block. A story about dogs might have four "dog" blocks and one "park" block. By looking at what is in each bag, a computer can tell the two stories are about different things, even though it never read them like you would. The bag is also mostly empty: it is sized to hold any of 50,000 words but a typical story only uses 200 of them. That is what computer scientists mean when they say BoW vectors are sparse and high-dimensional.
TF-IDF, BM25, Word2Vec, GloVe, fastText, BERT, Latent Dirichlet Allocation, latent semantic analysis, naive Bayes, scikit-learn, n-gram, stemming, lemmatization, information retrieval.