Word Embedding

A word embedding is a learned representation of text in which words are mapped to dense vectors of real numbers in a continuous vector space. Unlike sparse, high-dimensional representations such as one-hot encoding, word embeddings capture semantic and syntactic relationships between words by positioning similar words close together in the embedding space. Word embeddings are foundational to modern natural language processing (NLP) and have become a standard input representation for tasks ranging from text classification to machine translation.

Explain like I'm 5

Imagine you have a giant box of LEGO bricks, and each brick represents a word. Some bricks are really similar (like "happy" and "glad"), so you put them close together on a shelf. Other bricks are very different (like "happy" and "volcano"), so they go far apart. A word embedding is like a map that tells the computer where to put each word-brick on the shelf. The computer reads millions of sentences and figures out which words show up in the same kinds of sentences. Words that keep appearing near the same neighbors get placed close together on the map. That way, when the computer sees a new sentence, it already knows which words are related, just by looking at where they sit on the map.

Why embeddings matter

Traditional approaches to representing words in a computer use one-hot encoding, where each word in a vocabulary of size V is represented as a vector of length V with a single 1 and all other entries set to 0. This scheme has three major problems:

High dimensionality. A vocabulary of 100,000 words requires vectors of 100,000 dimensions, making computation expensive.
No similarity information. Every pair of one-hot vectors is orthogonal, so the representation treats "cat" and "kitten" as equally different from each other as "cat" and "refrigerator."
Poor generalization. Because there is no shared structure, a model that learns something about "cat" gains no information about "kitten."

Word embeddings address all three issues. They compress words into low-dimensional vectors (typically 50 to 300 dimensions) and arrange words so that semantically related terms occupy nearby regions in the vector space.

The distributional hypothesis

The theoretical foundation for word embeddings comes from the distributional hypothesis, which states that words appearing in similar contexts tend to have similar meanings. This idea traces back to linguist J.R. Firth, who wrote in 1957: "You shall know a word by the company it keeps." The distributional hypothesis motivates all major embedding algorithms: if two words frequently co-occur with the same neighbors, their vector representations should be close together.

Historical context

Latent semantic analysis

Before neural word embeddings, researchers explored distributional representations through matrix factorization methods. Latent Semantic Analysis (LSA), introduced by Deerwester et al. in 1990, applies singular value decomposition (SVD) to a term-document matrix to produce low-dimensional word and document representations. LSA captures latent semantic structure, but it relies on global co-occurrence statistics and does not model word order or local context well.

Neural language models

Bengio et al. (2003) proposed one of the first neural network-based language model architectures that learned continuous word representations as a byproduct of predicting the next word in a sequence. This work showed that jointly learning word vectors and a language model could capture syntactic and semantic regularities, but training was computationally expensive for large vocabularies.

Word2Vec

The most influential breakthrough in word embeddings came with Word2Vec, introduced by Tomas Mikolov and colleagues at Google in 2013. Word2Vec demonstrated that simple, shallow neural networks trained on large corpora could produce high-quality word vectors far more efficiently than previous methods.

Word2Vec offers two architectures:

Continuous bag of words (CBOW)

CBOW predicts a target word from its surrounding context words. Given a context window (for example, four words before and after the target), the model averages the embedding vector of the context words and passes the result through a softmax layer to predict the target word. CBOW is faster to train and works well for frequent words.

Skip-gram

Skip-gram reverses the prediction: given a target word, it predicts the surrounding context words. This architecture tends to perform better with smaller training corpora and produces stronger representations for rare words, because each training example generates multiple context-word predictions.

Negative sampling

Computing the full softmax over a large vocabulary is prohibitively expensive. Mikolov et al. introduced negative sampling to make training practical. Instead of updating weights for every word in the vocabulary, the model updates only the target word and a small number of randomly sampled "negative" words (words not present in the context). This reduces each training step from an operation over the entire vocabulary to an operation over a handful of words, making large-scale training feasible.

Property	CBOW	Skip-gram
Prediction direction	Context predicts target	Target predicts context
Training speed	Faster	Slower
Rare word performance	Weaker	Stronger
Best suited for	Large corpora, frequent words	Smaller corpora, rare words
Typical use case	General-purpose embeddings	Specialized or technical vocabularies

GloVe

GloVe (Global Vectors for Word Representation), developed by Pennington, Socher, and Manning at Stanford in 2014, takes a different approach. Rather than training a neural network on local context windows, GloVe constructs a global word-word co-occurrence matrix from the entire corpus and then factorizes it to produce word vectors.

The key insight behind GloVe is that the ratio of co-occurrence probabilities between two words can encode meaning. For instance, the ratio of how often "ice" and "steam" each co-occur with "solid" versus "gas" reveals their relationship to physical states. GloVe trains a log-bilinear regression model on the nonzero entries of the co-occurrence matrix, combining the strengths of global matrix factorization (like LSA) and local context-window methods (like Word2Vec).

GloVe achieved 75% accuracy on the Google word analogy benchmark and has been widely adopted, with pre-trained vectors available in 50, 100, 200, and 300 dimensions trained on corpora ranging from Wikipedia to Common Crawl.

FastText

FastText, introduced by Bojanowski et al. at Facebook AI Research in 2017, extends Word2Vec by representing each word as a bag of words of character n-grams. For example, the word "running" (with boundary markers) is decomposed into substrings like "<ru", "run", "unn", "nni", "nin", "ing", "ng>", and the word's embedding is computed as the sum of its n-gram embeddings.

This subword approach provides three important advantages:

Out-of-vocabulary handling. Words absent from the training vocabulary still receive meaningful embeddings based on their character n-grams. Word2Vec and GloVe cannot produce representations for unseen words.
Morphological awareness. Words with shared morphemes (like "running", "runner", and "ran") share n-gram components, so their embeddings naturally capture morphological relationships.
Better rare word representations. Rare words benefit from parameter sharing with more common words that contain overlapping substrings.

FastText supports 157 languages and provides pre-trained word vectors for each, making it especially valuable for low-resource languages.

Comparison of static embedding methods

Method	Year	Key idea	Training approach	OOV handling	Typical dimensions
LSA	1990	SVD on term-document matrix	Matrix factorization	None	100-300
Word2Vec	2013	Predict words from context	Shallow neural network	None	100-300
GloVe	2014	Co-occurrence probability ratios	Log-bilinear regression	None	50-300
FastText	2017	Subword character n-grams	Shallow neural network	Yes, via subwords	100-300

Properties of word vectors

Well-trained word embeddings exhibit several notable properties that have made them central to NLP research.

Semantic arithmetic

One of the most famous findings is that word vectors support meaningful arithmetic operations. The classic example is:

vector("king") - vector("man") + vector("woman") ≈ vector("queen")

This suggests that the embedding space encodes relational structure. The vector offset between "king" and "man" captures the concept of royalty independent of gender, and adding that offset to "woman" produces a vector close to "queen." Other examples include:

vector("Paris") - vector("France") + vector("Italy") ≈ vector("Rome")
vector("bigger") - vector("big") + vector("small") ≈ vector("smaller")

Clustering

Words with related meanings naturally form clusters in the embedding space. Plotting word vectors using dimensionality reduction techniques like t-SNE or PCA reveals that countries group together, animals group together, and professions group together, even though no category labels were provided during training.

Cosine similarity

The standard measure for comparing word embeddings is cosine similarity, defined as:

cos(A, B) = (A · B) / (||A|| × ||B||)

Cosine similarity ranges from -1 (opposite) to 1 (identical direction), with 0 indicating orthogonality. It is preferred over Euclidean distance for word vectors because it measures the angle between vectors rather than their magnitude, making it invariant to vector length.

Choosing embedding dimensions

The dimensionality of word embeddings involves a trade-off between expressiveness and efficiency. Common choices range from 50 to 300 dimensions.

Dimension range	Characteristics	Best suited for
50-100	Compact, fast, lower memory	Information retrieval, simple classification, visualization
150-200	Balanced performance	General-purpose NLP tasks
200-300	More nuanced semantic capture	Sentiment analysis, machine translation, analogy tasks
300+	Diminishing returns for static embeddings	Specialized research applications

A common rule of thumb sets the dimension to approximately the fourth root of the vocabulary size (V^(1/4)), though empirical tuning on a validation set remains the most reliable approach. Increasing dimensions beyond a certain point leads to diminishing returns and can cause overfitting, particularly with limited training data.

Pre-trained embeddings and transfer learning

Training word embeddings from scratch requires large corpora and substantial computation. In practice, researchers and engineers frequently use pre-trained embeddings as a starting point. Popular pre-trained options include:

GloVe vectors trained on Wikipedia and Common Crawl (6 billion and 840 billion tokens, respectively)
Word2Vec vectors trained on Google News (approximately 100 billion words)
FastText vectors for 157 languages, trained on Wikipedia and Common Crawl

Pre-trained embeddings can be used in two ways. In feature extraction, the embeddings are frozen and used as fixed input features for a downstream model. In fine-tuning, the embeddings are initialized with pre-trained values but updated during training on a task-specific dataset. Fine-tuning typically yields better performance when sufficient task-specific data is available, while feature extraction is preferred when data is limited and there is a risk of overfitting.

Contextualized embeddings

Static word embeddings (Word2Vec, GloVe, FastText) assign a single vector to each word regardless of context. This means the word "bank" receives the same representation whether it appears in "river bank" or "bank account." Contextualized embeddings solve this problem by generating different representations for the same word depending on its surrounding context.

ELMo

ELMo (Embeddings from Language Models), introduced by Peters et al. in 2018, generates contextualized word representations using a deep bidirectional language model (biLM). The model consists of two-layer bidirectional LSTMs trained on a large text corpus. A key innovation of ELMo is that it combines representations from all layers of the biLM, not just the top layer. The lower layers tend to capture syntactic information, while the higher layers capture semantic information. Adding ELMo representations to existing models improved performance across six major NLP benchmarks, including question answering, textual entailment, and sentiment analysis.

BERT embeddings

BERT (Bidirectional Encoder Representations from Transformers), introduced by Devlin et al. in 2018, uses the transformer architecture to produce contextualized embeddings. Unlike ELMo's sequential LSTM approach, BERT processes all tokens in a sentence simultaneously through self-attention mechanisms, allowing each token's representation to be influenced by every other token. BERT is pre-trained on masked language modeling (predicting randomly masked tokens) and next sentence prediction, then fine-tuned for downstream tasks.

Static vs. contextualized embeddings

Feature	Static embeddings	Contextualized embeddings
Representation	One vector per word	Different vector per word occurrence
Polysemy handling	Cannot distinguish senses	Produces sense-specific vectors
Model complexity	Shallow (1-2 layers)	Deep (12-24+ layers)
Typical dimensions	50-300	768-1,024
Computational cost	Low	High
Training data needed	Moderate	Large
Examples	Word2Vec, GloVe, FastText	ELMo, BERT, GPT

Sentence and document embeddings

Word embeddings represent individual words, but many applications require representations of entire sentences, paragraphs, or documents.

Simple aggregation

The simplest approach averages the word embeddings across all words in a sentence or document. While this loses word order information, it often provides a surprisingly strong baseline for tasks like document similarity.

Doc2Vec

Doc2Vec (Paragraph Vectors), introduced by Le and Mikolov in 2014, extends Word2Vec by adding a document-level vector that is trained alongside word vectors. It comes in two variants: Distributed Memory (PV-DM), which is analogous to CBOW with an additional paragraph vector, and Distributed Bag-of-Words (PV-DBOW), which is analogous to Skip-gram. Doc2Vec produces fixed-length representations for variable-length text.

Sentence transformers

Sentence-BERT (Reimers and Gurevych, 2019) fine-tunes BERT using siamese and triplet network structures to produce semantically meaningful sentence embeddings. Unlike extracting embeddings from BERT's [CLS] token (which performs poorly for similarity tasks without fine-tuning), Sentence-BERT is specifically optimized for comparing sentences using cosine similarity.

Modern embedding models

The embedding landscape has evolved considerably since Word2Vec. Modern embedding models are typically based on transformer architectures and produce contextualized, high-dimensional representations.

Model	Provider	Dimensions	Key features
text-embedding-ada-002	OpenAI	1,536	General-purpose, unified model replacing earlier specialized models
text-embedding-3-small	OpenAI	1,536	5x cheaper than ada-002, with comparable quality
text-embedding-3-large	OpenAI	3,072	Highest quality OpenAI model, supports dimension reduction
E5-large-v2	Microsoft	1,024	Open-source, strong on retrieval benchmarks
all-MiniLM-L6-v2	Sentence-Transformers	384	Lightweight, fast, good for clustering and semantic search
BGE-large-en-v1.5	BAAI	1,024	Open-source, top-performing on MTEB benchmark
Cohere embed-v3	Cohere	1,024	Supports 100+ languages, search and classification optimized

OpenAI's newer text-embedding-3-large model achieves an average MTEB score of 64.6%, compared to 61.0% for text-embedding-ada-002. A notable feature of the newer models is native support for Matryoshka representation learning, which allows developers to truncate embeddings to smaller dimensions with minimal quality loss, trading off accuracy for reduced storage and computation.

Bias in word embeddings

Word embeddings learn from human-generated text and consequently absorb the biases present in that text. Bolukbasi et al. (2016) demonstrated that Word2Vec embeddings trained on Google News articles encode gender stereotypes. For example, the analogy "man is to computer programmer as woman is to homemaker" emerged from the learned vector relationships. Other studies have documented racial, religious, and age-related biases in standard embedding models.

Several debiasing techniques have been proposed:

Post-hoc projection. Bolukbasi et al. proposed identifying a "gender direction" in the embedding space and projecting gender-neutral words onto the subspace orthogonal to it, reducing stereotypical associations while preserving useful semantic properties.
Counterfactual data augmentation. Training data is augmented with gender-swapped versions of sentences to balance the representation of different groups.
Constrained learning. Bias constraints are incorporated directly into the training objective.

Despite these techniques, fully eliminating bias from embeddings remains an open research problem. Downstream applications using word embeddings should be audited for fairness, particularly in high-stakes domains like hiring, lending, and criminal justice.

Evaluation of word embeddings

Embedding quality is typically assessed through intrinsic and extrinsic evaluation methods.

Intrinsic evaluation

Intrinsic evaluation directly tests the properties of the embedding space without reference to a downstream application.

Benchmark	Task	Description
Google Analogy	Word analogy	19,544 analogy questions covering semantic and syntactic relations
BATS	Word analogy	99,200 questions across 4 relation categories
WordSim-353	Word similarity	353 word pairs rated by human judges for similarity
SimLex-999	Word similarity	999 pairs specifically testing genuine similarity (not relatedness)
MEN	Word relatedness	3,000 word pairs rated for relatedness

Extrinsic evaluation

Extrinsic evaluation measures how well embeddings improve performance on a real downstream task. Common tasks include:

Named entity recognition (NER): Identifying and classifying named entities in text
Sentiment analysis: Determining the emotional tone of a document
Part-of-speech tagging: Assigning grammatical categories to words
Text classification: Categorizing documents into predefined classes
Machine translation: Translating text between languages

Intrinsic and extrinsic evaluations do not always correlate perfectly. Embeddings that score highest on word analogy tasks may not produce the best downstream performance, so practitioners typically evaluate on the specific task of interest.

Applications

Word embeddings are used across a wide range of applications in NLP and beyond.

Search and information retrieval. Embedding-based search can find documents that are semantically relevant to a query, even when there is no exact keyword match.
Recommendation systems. Items (products, articles, movies) can be embedded in the same vector space as user preferences, enabling similarity-based recommendations.
Machine translation. Cross-lingual word embeddings map words from different languages into a shared space, enabling translation without parallel corpora.
Sentiment analysis. Pre-trained embeddings provide a strong initialization for sentiment classifiers, especially when labeled data is limited.
Natural language understanding. Embeddings serve as the input layer for virtually all modern NLU systems, from chatbots to question-answering models.
Clustering and topic modeling. Grouping semantically similar words or documents based on their embedding proximity.
Biomedical NLP. Domain-specific embeddings trained on medical literature improve performance on clinical text mining, drug interaction detection, and biomedical entity recognition.

Limitations

Despite their usefulness, word embeddings have several known limitations:

Static embeddings ignore context. Word2Vec, GloVe, and FastText assign one vector per word, which cannot distinguish different senses of polysemous words.
Training data dependency. Embedding quality depends heavily on the size, domain, and quality of the training corpus. Embeddings trained on news text may perform poorly on social media or biomedical text.
Bias amplification. As discussed above, embeddings can encode and even amplify societal biases present in training data.
Difficult interpretability. Individual dimensions of word vectors rarely correspond to interpretable features, making it hard to explain what a particular embedding captures.
Fixed vocabulary. Word2Vec and GloVe cannot handle words absent from the training vocabulary (though FastText partially addresses this with subword information).

References

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). "Efficient Estimation of Word Representations in Vector Space." *arXiv:1301.3781*.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). "Distributed Representations of Words and Phrases and their Compositionality." *Advances in Neural Information Processing Systems*, 26.
Pennington, J., Socher, R., & Manning, C.D. (2014). "GloVe: Global Vectors for Word Representation." *Proceedings of EMNLP 2014*, pp. 1532-1543.
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). "Enriching Word Vectors with Subword Information." *Transactions of the Association for Computational Linguistics*, 5, pp. 135-146.
Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). "Deep Contextualized Word Representations." *Proceedings of NAACL-HLT 2018*.
Devlin, J., Chang, M.W., Lee, K., & Toutanova, K. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." *Proceedings of NAACL-HLT 2019*.
Bolukbasi, T., Chang, K.W., Zou, J., Saligrama, V., & Kalai, A. (2016). "Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings." *Advances in Neural Information Processing Systems*, 29.
Le, Q., & Mikolov, T. (2014). "Distributed Representations of Sentences and Documents." *Proceedings of ICML 2014*.
Reimers, N., & Gurevych, I. (2019). "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks." *Proceedings of EMNLP-IJCNLP 2019*.
Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). "A Neural Probabilistic Language Model." *Journal of Machine Learning Research*, 3, pp. 1137-1155.
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., & Harshman, R. (1990). "Indexing by Latent Semantic Analysis." *Journal of the American Society for Information Science*, 41(6), pp. 391-407.
Firth, J.R. (1957). "A Synopsis of Linguistic Theory, 1930-1955." In *Studies in Linguistic Analysis*, pp. 1-32. Philological Society, Oxford.
Baroni, M., Dinu, G., & Kruszewski, G. (2014). "Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors." *Proceedings of ACL 2014*.
Levy, O., & Goldberg, Y. (2014). "Neural Word Embedding as Implicit Matrix Factorization." *Advances in Neural Information Processing Systems*, 27.

Explain like I'm 5

Why embeddings matter

The distributional hypothesis

Historical context

Latent semantic analysis

Neural language models

Word2Vec

Continuous bag of words (CBOW)

Skip-gram

Negative sampling

GloVe

FastText

Comparison of static embedding methods

Properties of word vectors

Semantic arithmetic

Clustering

Cosine similarity

Choosing embedding dimensions

Pre-trained embeddings and transfer learning

Contextualized embeddings

ELMo

BERT embeddings

Static vs. contextualized embeddings

Sentence and document embeddings

Simple aggregation

Doc2Vec

Sentence transformers

Modern embedding models

Bias in word embeddings

Evaluation of word embeddings

Intrinsic evaluation

Extrinsic evaluation

Applications

Limitations

References

Improve this article

Related Articles

Sparse autoencoder

Context window

OCR Models

Post-training

Pre-training

Supervised fine-tuning

Explain like I'm 5

Why embeddings matter

The distributional hypothesis

Historical context

Latent semantic analysis

Neural language models

Word2Vec

Continuous bag of words (CBOW)

Skip-gram

Negative sampling

GloVe

FastText

Comparison of static embedding methods

Properties of word vectors

Semantic arithmetic

Clustering

Cosine similarity

Choosing embedding dimensions

Pre-trained embeddings and transfer learning

Contextualized embeddings

ELMo

BERT embeddings

Static vs. contextualized embeddings

Sentence and document embeddings

Simple aggregation

Doc2Vec

Sentence transformers

Modern embedding models

Bias in word embeddings

Evaluation of word embeddings

Intrinsic evaluation

Extrinsic evaluation

Applications

Limitations

References

Related Articles

Sparse autoencoder