word2vec
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v7 · 6,697 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v7 · 6,697 words
Add missing citations, update stale details, or suggest a clearer explanation.
word2vec is a family of shallow neural network models for learning dense vector representations of words from large unlabeled text corpora. It was developed by Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean at Google and introduced in two influential 2013 papers: "Efficient Estimation of Word Representations in Vector Space" (arXiv:1301.3781, January 2013), which proposed the two core architectures Continuous Bag-of-Words (CBOW) and Skip-gram[1]; and "Distributed Representations of Words and Phrases and their Compositionality" (NIPS 2013, arXiv:1310.4546, October 2013), which added Ilya Sutskever as a co-author and introduced the negative sampling and subsampling techniques that made very large scale training practical[2]. The second paper received the NeurIPS Test of Time Award in 2023[3].
word2vec is the foundational neural word embedding method. It demonstrated that very simple log-bilinear architectures, trained on billions of words, could produce vectors whose geometry encodes both semantic and syntactic relationships. Its most celebrated property is vector arithmetic over meanings, with the now-canonical relation vec("king") - vec("man") + vec("woman") returning a vector whose nearest neighbor is vec("queen")[1][2]. The release of Google's open-source C implementation and the pre-trained GoogleNews-vectors-negative300 model[4] made high-quality embeddings freely available to the research community, and the method became one of the most widely deployed building blocks in natural language processing (NLP) for the next several years. While word2vec has largely been superseded for state-of-the-art NLP by contextual embeddings such as ELMo (2018)[5] and BERT (2018)[6], it remains a workhorse for information retrieval, recommendation, and classical NLP pipelines, and is still a standard teaching tool[7].
Before word2vec, NLP systems most commonly represented words as sparse, high-dimensional vectors. One-hot encoding assigns each word a vector of length equal to the vocabulary size with a single 1 and the rest zeros; under this scheme all distinct words are equidistant, so the representation carries no information about meaning. Term-weighting schemes such as TF-IDF augment a sparse vector with corpus statistics but still treat distinct word types as orthogonal.
Distributional methods improve on this by extracting structure from word co-occurrence. Latent Semantic Analysis (LSA) builds a term-document matrix and reduces it via singular value decomposition; Hyperspace Analogue to Language (HAL) and similar methods factorize term-term co-occurrence matrices. These count-based approaches operationalize the distributional hypothesis, Firth's 1957 observation that "you shall know a word by the company it keeps," but they require materializing very large matrices and the resulting dimensions do not, in general, align cleanly with interpretable linguistic relations.
A parallel line of work used neural networks. Bengio et al. (2003) proposed a neural probabilistic language model whose hidden layer doubled as a learned distributed word representation, simultaneously learning a continuous representation for each word and a probability function for word sequences expressed in terms of those representations[8]. The Bengio approach fights the curse of dimensionality by tying generalization to similarity in vector space rather than to surface n-gram overlap. Collobert and Weston (2008, 2011) extended this idea with a single convolutional architecture trained jointly on part-of-speech tagging, chunking, named entity recognition, and semantic role labeling using a shared embedding lookup table; their multi-task system showed that generic embeddings pre-trained on unlabeled text could match or beat heavily engineered task-specific baselines[9]. Andriy Mnih and Geoffrey Hinton's hierarchical log-bilinear language model and Turian et al.'s comparative study of pre-trained representations across NLP tasks were also part of the immediate background. However, these neural approaches were computationally expensive: their hidden nonlinearities and full softmax outputs scaled poorly, restricting them to vocabularies and corpora orders of magnitude smaller than what was readily available on the web.
Mikolov's contribution was to strip the architecture down to the minimum necessary to learn high-quality vectors, and to combine that simpler model with engineering tricks that drove the training cost low enough to use corpora of billions of tokens, turning word embedding from a research curiosity into a commodity[1]. The model has no hidden non-linearity at all: it is a pure log-bilinear projection, with all of the representational capacity concentrated in two embedding matrices of size V x d. By eliminating the per-step matrix multiplication through a hidden layer of hundreds of units, training throughput rose by orders of magnitude.
The word2vec project was led by Tomas Mikolov, then a researcher in the Brain team at Google in Mountain View. Mikolov had earned his PhD at Brno University of Technology in the Czech Republic with a thesis on recurrent neural network language models, work that produced the original RNNLM toolkit[10]. Before joining Google he had also worked at Microsoft Research. His Google co-authors were Kai Chen, Greg Corrado, and Jeff Dean on the January 2013 paper, with Ilya Sutskever added on the October 2013 paper.
Mikolov moved to Facebook AI Research in 2014, where he continued working on word embeddings, co-authoring FastText[11] and the StarSpace generalization[12], and on language modeling more broadly. In March 2020 he returned to the Czech Republic as a Senior Research Scientist at the Czech Institute of Informatics, Robotics and Cybernetics (CIIRC) at Czech Technical University in Prague, where he now leads a research group on foundational language models and complex systems[10][13].
The reception of the first paper was not initially smooth. The January 2013 manuscript was submitted to the inaugural ICLR conference and rejected, drawing "strong reject" reviews from multiple reviewers; one objection was that the model discards word order, another sought to compel additional citations[14]. The paper was admitted to the ICLR 2013 workshop track that May in Scottsdale, Arizona, and the open-source release of Google's C code was reportedly delayed by months while Mikolov navigated Google's internal approval process[7][14]. Mikolov has publicly noted that word2vec is now probably more cited than every paper accepted to the main ICLR 2013 track combined[14].
The pair of word2vec papers attracted citations almost immediately. By the time of the NeurIPS 2023 Test of Time Award, the 2013 NeurIPS paper alone had accumulated more than 40,000 citations, and the prize was accepted in person by Jeffrey Dean and Greg Corrado on behalf of all five authors[3][15]. The NeurIPS citation framed the work as having "catalyzed progress that marked the beginning of a new era in natural language processing" by "demonstrating the power of learning from large amounts of unstructured text"[3].
word2vec is associated with two papers from 2013 that contributed different pieces of the now-standard recipe.
The first paper, "Efficient Estimation of Word Representations in Vector Space," was posted to arXiv on 16 January 2013 (arXiv:1301.3781) and presented at the ICLR 2013 workshop track in Scottsdale, Arizona that May[1]. Its central contributions are the CBOW and Skip-gram architectures and an empirical demonstration that, on a new analogy benchmark the authors released, these simple log-bilinear models trained on 6 billion tokens outperformed neural network language models trained on far less data and at far higher cost. The paper introduced the now-famous syntactic and semantic analogy task and reported that, on the same 783 million word training set used for prior NNLM and RNNLM baselines, 300-dimensional Skip-gram reached 50.0% semantic and 55.9% syntactic accuracy where the contemporaneous RNNLM reached only 8.6% semantic[1]. Scaling Skip-gram to 1000 dimensions and 6 billion tokens pushed the overall total to 65.6%[1].
The second paper, "Distributed Representations of Words and Phrases and their Compositionality," was posted to arXiv on 16 October 2013 (arXiv:1310.4546) and published at NIPS 2013[2]. Its contributions are training-side optimizations and extensions:
vec("Russia") + vec("river") is close to vec("Volga River"))[2].Both papers were submitted while all authors were at Google, and Google released the reference C implementation as open source under the URL code.google.com/p/word2vec (now hosted on the Google Code Archive)[4]. The repository was also covered by a US patent assigned to Google (US 9,037,464 B1, filed January 2013, granted May 2015), an unusual situation that did not impede academic use given the simultaneous open-source release[16].
word2vec defines two architectures that share a common skeleton: an input embedding matrix W ∈ R^{V×d} mapping each of V vocabulary items to a d-dimensional vector, a single linear projection layer (no hidden nonlinearity), and an output matrix W' ∈ R^{V×d}. The training pairs are extracted by sliding a context window of width c across the corpus. The two architectures differ in which side of the (target, context) relationship is the input and which is the prediction target.
A key architectural decision is that each word holds two distinct vectors, an input vector (sometimes called the target vector) used when the word appears as the focal token and an output vector (the context vector) used when the word appears in a window around some other token. After training, the standard practice is to discard the output matrix and use only the input embeddings, but a few later analyses recommend averaging the two matrices for slightly higher quality on some tasks[17]. Vector dimensionality d is a hyperparameter; on Google News, the released embeddings use d = 300, and quality typically improves with dimensionality up to roughly 300-1000 before saturating[2][7].
Skip-gram takes the target word as input and predicts each surrounding context word independently. For a sentence w_1, w_2, ..., w_T and window radius c, the training objective is
J = (1/T) sum_{t=1..T} sum_{-c <= j <= c, j != 0} log p(w_{t+j} | w_t)
where the conditional p(w_O | w_I) is modeled with a softmax over v_{w_O}^T · v_{w_I} (with separate input/output embeddings)[1][2]. Each occurrence of a word w_t therefore produces up to 2c training pairs (w_t -> w_{t+j}), so even rare words generate many gradient updates per occurrence. In practice Skip-gram is the variant most associated with the word2vec name: it tends to produce slightly higher-quality semantic embeddings, particularly for rare words and on the analogy benchmark, at the cost of training time roughly linear in the window size[1][2]. The original C implementation also samples the window radius uniformly between 1 and c for each token, which effectively gives closer context words a higher weight without adding a learned weighting scheme[4][17].
The per-update training cost of Skip-gram with full softmax is Q = C · (D + D · log2(V)) when paired with hierarchical softmax, where C is the expected window size and D the embedding dimension[1]. Replacing the inner D · log2(V) term with k · D for negative sampling yields the variant that dominates in practice.
Continuous Bag-of-Words inverts the prediction: the context predicts the target. The model averages the input embeddings of the surrounding context words within the window, projects the result through the output matrix, and is trained with a softmax to assign high probability to the true center word[1]. The objective is
J = (1/T) sum_{t=1..T} log p(w_t | w_{t-c}, ..., w_{t-1}, w_{t+1}, ..., w_{t+c}).
"Bag-of-words" reflects the architectural choice to average context vectors symmetrically, discarding the order of context positions. CBOW is faster than Skip-gram because each window produces a single prediction rather than 2c, and it tends to give slightly better representations for very frequent words, since multiple context observations are smoothed in a single update[1]. The CBOW per-update complexity is Q = N · D + D · log2(V) with hierarchical softmax, where N is the number of context words[1]. Skip-gram is the more common default in downstream applications, particularly for tasks that depend on rare-word quality.
A side-by-side comparison:
| Property | CBOW | Skip-gram |
|---|---|---|
| Input | Bag of context words | Single target word |
| Output | Single target word | Each context word |
| Training updates per occurrence | One | Up to 2c |
| Training speed | Faster | Slower |
| Frequent-word quality | Slightly better | Slightly worse |
| Rare-word and semantic quality | Slightly worse | Slightly better |
| Recommended context window | 5 (default) | 10 (default) |
The recommended context window sizes (5 for CBOW, 10 for Skip-gram) are the defaults shipped with the original C tool and reflect the asymmetry of the two objectives[4][7].
For a vocabulary of V words, frequently in the millions for web-scale corpora, the obvious softmax denominator (a sum of V exponentials) makes naive training infeasible. The October 2013 paper introduced two practical alternatives, with a third trick (subsampling) addressing a separate problem with the data distribution[2].
Hierarchical softmax replaces the flat output layer with a binary tree whose leaves are the vocabulary words. A logistic regression at each internal node decides whether to go left or right, and p(w | context) is the product of those binary probabilities along the unique root-to-leaf path for w. With a balanced tree the cost of one update drops from O(V) to O(log V); the original paper uses a Huffman tree keyed on word frequency, so the expected path length for a randomly drawn target is even shorter than a balanced binary tree[1][18]. For a 1-million-word vocabulary this reduces the inner loop from a million dot products to roughly 20. Each internal node carries its own learned vector that participates in the path probability, so the total parameter count for the tree is (V - 1) · d, roughly matching the parameter count of the original output matrix it replaces[18].
Hierarchical softmax tends to give better results than negative sampling for rare words and small training corpora, where the negative-sampling estimator's variance is highest[2][17]. It is also the option of choice when one needs a properly normalized distribution rather than a similarity score, since the tree induces a valid probability over the entire vocabulary.
Negative sampling (NEG) is the better-known approach and is the default in most word2vec implementations. Instead of computing a normalized distribution over the whole vocabulary, it casts each (target, context) example as a logistic classification problem: distinguish the true context word from k noise words sampled from a noise distribution. The objective for a single positive pair (w_I, w_O) is
log sigma(v_{w_O}^T · v_{w_I}) + sum_{i=1..k} E_{w_n ~ P_n} [log sigma(-v_{w_n}^T · v_{w_I})].
Mikolov et al. recommend k = 5–20 for small training datasets and k = 2–5 for very large ones[2]. The noise distribution P_n(w) is the empirical unigram distribution U(w) raised to the 3/4 power and renormalized; this exponent was selected empirically by trying several variants and gives rare words a relatively higher chance of being picked as negatives than U(w) would, improving their embeddings[2][19]. The 3/4 exponent is also widely used as the noise distribution for descendants of word2vec including FastText, item2vec, and node2vec[20].
Negative sampling is a coarser approximation of softmax than noise-contrastive estimation (NCE) from which it descends, and Goldberg and Levy's companion derivation in 2014 made the relationship precise: NEG corresponds to the limit case where the partition function is treated as a constant and the noise distribution is reweighted, sacrificing strict probabilistic interpretation in exchange for very fast training[19]. Levy and Goldberg's later NIPS 2014 analysis showed that, in the limit of unlimited capacity, Skip-gram with negative sampling (often abbreviated SGNS) is implicitly factorizing a matrix of shifted PMI values (see "Theoretical understanding" below)[21].
In any natural corpus the unigram distribution is dominated by a handful of function words (the, of, a, and), each of which appears in millions of windows but provides little information about meaning. word2vec subsamples these by discarding each occurrence of a word w with probability
P(discard | w) = 1 - sqrt(t / f(w)),
where f(w) is the relative frequency of w and t is a threshold typically around 10^-5[2]. The discard probability is essentially zero for moderately rare words and approaches 1 for the most common tokens. The practical effect is a 2-10x speed-up in training, a small expansion of the effective window (since many "the"-tokens are dropped, the surviving context words are sometimes farther apart in the original text), and a measurable improvement on rare-word similarity and analogy tasks[2]. The formula keeps the relative frequency ranking intact while sharply down-weighting the very top of the unigram distribution.
A separate but related trick used in the C implementation is dynamic context window sampling: for each token, the actual window radius is sampled uniformly in [1, c]. This implicitly weights context words by inverse distance and is sometimes credited with a sizable share of word2vec's quality advantage over simpler co-occurrence baselines[17].
The clearest demonstration of word2vec's structure is the analogy task the papers used as a benchmark. Given an analogy A : B :: C : ?, the model returns
argmax_{w in V, w != A, B, C} cos( v_w, v_B - v_A + v_C ).
This 3CosAdd rule operationalizes the parallelogram model of analogy: the vector offset v_B - v_A is taken to capture the relation between A and B, and adding it to v_C is expected to land near the analogous word. When trained on enough data with Skip-gram or CBOW, these computations recover plausible answers for a wide range of relationships[1][2]:
vec("king") - vec("man") + vec("woman") ≈ vec("queen")vec("Paris") - vec("France") + vec("Italy") ≈ vec("Rome")vec("Berlin") - vec("Germany") + vec("Japan") ≈ vec("Tokyo")vec("Madrid") - vec("Spain") + vec("Russia") ≈ vec("Moscow")vec("walking") - vec("walk") + vec("swim") ≈ vec("swimming")vec("cars") - vec("car") + vec("apple") ≈ vec("apples")vec("biggest") - vec("big") + vec("small") ≈ vec("smallest")To benchmark this, Mikolov et al. assembled a 19,544-question analogy test set (the "Google analogy test set" distributed as questions-words.txt) covering 14 relation types: 5 semantic categories (capital-common-countries, capital-world, currency, city-in-state, family) and 9 syntactic categories (adjective-to-adverb, opposite, comparative, superlative, present-participle, nationality-adjective, past-tense, plural, plural-verbs)[22]. The test set contains 8,869 semantic and 10,675 syntactic questions, with each relation represented by 20-100 example pairs combined into all-vs-all analogies. On 6B tokens, 1000-dimensional Skip-gram achieves about 65.6% accuracy overall on this benchmark[1]. The October 2013 paper extends the analysis to phrases, showing for instance that vec("Vietnam") + vec("capital") is close to vec("Hanoi") and that vector addition behaves like a soft semantic conjunction[2].
The analogy benchmark drew justified scepticism by 2016. Linzen (2016) and Rogers et al. observed that the standard evaluation procedure explicitly excludes the three input words A, B, C from the candidate set, which inflates accuracy: a substantial fraction of the apparent successes occur because the input word would otherwise have been the closest match[23]. Drozd, Gladkova, and Matsuoka in their 2016 COLING paper "Word Embeddings, Analogies, and Machine Learning: Beyond king - man + woman = queen" showed that the simple 3CosAdd rule misses relational information that the embeddings clearly do encode but that can be recovered only by more sophisticated readout methods, and they proposed an alternative LRCos method that uses logistic regression over cosine features to substantially outperform vector arithmetic[24]. Allen and Hospedales (2019) provided a partial theoretical account of why analogies work at all, deriving conditions under which the parallelogram approximation is exact for jointly co-occurring relations[25].
The qualitative geometric phenomenon, that consistent relations correspond to roughly parallel vector offsets, is robust and was the main observation that made word2vec famous. But the standard 3CosAdd evaluation is now understood to overstate the regularity of the embedding space, and modern evaluations supplement it with relation prediction, lexical entailment, and probing tasks.
Alongside the code, Google released a set of pre-trained vectors that became the de facto default embedding for an entire era of NLP. The file GoogleNews-vectors-negative300.bin.gz is roughly 1.6 GB compressed (about 3.4 GB uncompressed) and contains 300-dimensional vectors for 3 million words and phrases, trained on a Google News corpus of approximately 100 billion tokens using Skip-gram with negative sampling[4][26]. The "phrases" included in the vocabulary are the multi-word expressions discovered by the phrase-learning procedure of the October 2013 paper.
The file format is a simple binary header (<vocab_size> <dim>\n) followed by each word as ASCII, a space, and dim * 4 bytes of single-precision floats. This compact format and its bundled C reader made the model trivial to load from C, C++, Python (via gensim), Java, and just about every other language with a binding. It is widely available on package mirrors and is the embedding most often referenced when older NLP papers say simply "we initialize with word2vec"[26].
Other notable pre-trained word2vec releases include domain-specific vectors trained on PubMed/MEDLINE biomedical text (e.g., the BioASQ vectors and Pyysalo's PubMed-PMC release), the freebase entity vectors that Google bundled with the original release, and the Wikipedia + Gigaword Skip-gram release distributed through the gensim project. Many of these community releases were trained by users with the same C code and uploaded for redistribution.
For two years after publication, word2vec's empirical success was widely seen as something of a black box. A few hyperparameter choices (the 3/4 power, the 10^-5 subsampling threshold, the choice of negative sampling over hierarchical softmax) had been arrived at empirically, and the theoretical relationship between the new "predictive" embeddings and the older "count-based" methods such as LSA was unclear.
In 2014, Omer Levy and Yoav Goldberg published "Neural Word Embedding as Implicit Matrix Factorization" at NIPS, which provided a clean answer for Skip-gram with negative sampling: in the limit of unbounded vector dimension, SGNS is implicitly factorizing the word-context matrix whose (w, c) entry is
PMI(w, c) - log k,
where PMI(w, c) = log [ #(w, c) · |D| / (#(w) · #(c)) ] is pointwise mutual information and k is the number of negative samples[21]. In other words, word2vec's predictive training is mathematically equivalent, at optimum and for sufficient dimension, to factorizing a shifted-PMI matrix, the same family of matrices that distributional or count-based methods had been factorizing for decades. Levy and Goldberg further showed that an explicit SVD on the same matrix produces embeddings of comparable quality to SGNS on standard benchmarks once a few preprocessing details (positive PMI, context smoothing) are matched[21]. A follow-on study, "Improving Distributional Similarity with Lessons Learned from Word Embeddings" (Levy, Goldberg, and Dagan, TACL 2015), demonstrated that most of word2vec's gains over earlier methods came from hyperparameter choices and the engineering tricks of subsampling, negative-sampling smoothing, and the 3/4 exponent rather than from the use of neural networks as such[17].
This result, together with the contemporaneous GloVe paper which derived embeddings explicitly from log co-occurrence counts[27], unified the predictive and count-based families and clarified why word2vec worked so well. A subsequent line of work by Arora, Liang, Ma, Risteski and others (the random-walk PMI model) gave a generative justification for both the embedding geometry and the parallelogram analogy property[28].
Google's reference implementation was released under the Apache License 2.0 at https://code.google.com/p/word2vec and is preserved on the Google Code Archive[4]. The code is roughly 700 lines of single-file C, multi-threaded with POSIX threads, and uses a precomputed sigmoid lookup table plus aggressive SSE-friendly memory layout to train Skip-gram on the 100B-token Google News corpus in roughly one day on a single multi-core CPU[4][7]. The same repository contains shell scripts for the analogy benchmark (questions-words.txt), the phrase-learning utility (word2phrase), and links to the GoogleNews binary download. Many later C/C++ implementations are direct forks or close ports.
In Python, the most widely used implementation is part of gensim, the topic-modeling and embeddings library created by Radim Řehůřek. Gensim's gensim.models.Word2Vec exposes CBOW and Skip-gram with both hierarchical softmax and negative sampling, supports streaming corpora that do not fit in RAM, and uses Cython-compiled inner loops to achieve roughly the same throughput as the original C code (the documentation cites a ~70x speedup over a pure-NumPy implementation)[29]. Gensim also provides loaders for Google's binary format (KeyedVectors.load_word2vec_format), tools for analogy and similarity benchmarks, and follow-on implementations of doc2vec and FastText. As of 2026 gensim remains the most widely used non-C word2vec implementation, and the KeyedVectors interface is the lingua franca for sharing static embeddings of any provenance.
TensorFlow, PyTorch, and most other deep-learning frameworks ship word2vec as one of their canonical "first models" tutorials. The TensorFlow team in particular maintained tutorials/text/word2vec and the older tensorflow/models Skip-gram-with-NCE example; these are pedagogically useful but slower than gensim or the C code at training scale, because they retain the framework overhead unnecessary for such a tiny model.
Notable third-party implementations include DL4J's Word2Vec for the JVM, Spark MLlib's distributed Word2Vec for cluster training, Hugging Face's word vector support inside the tokenizers and transformers ecosystems, and a long tail of Rust, Go, and JavaScript ports that target specific deployment targets such as browsers and embedded systems.
word2vec's most direct contemporaries are GloVe (Stanford NLP, EMNLP 2014) and FastText (Facebook AI Research, TACL 2017). The three methods produce static word vectors of broadly comparable quality, but they differ in mechanism and in the cases where one is clearly preferable.
| Property | word2vec (SGNS) | GloVe | FastText |
|---|---|---|---|
| Year | 2013 | 2014 | 2016/2017 |
| First author | Mikolov | Pennington | Bojanowski |
| Affiliation | Stanford | ||
| Objective | Predict context (or target) | Weighted least squares on log co-occurrence | Skip-gram over character n-grams |
| Input unit | Word | Word | Subword n-grams + word |
| Handles OOV | No | No | Yes (composes from n-grams) |
| Default pretrained vectors | GoogleNews 300d, 3M vocab | Common Crawl 300d, 1.9M vocab | Wikipedia+CC 300d, 157 languages |
| Trains in passes over corpus | Yes | No (precomputed matrix) | Yes |
| Memory dominated by | Vocabulary | Co-occurrence matrix | Subword + word tables |
GloVe (Pennington, Socher, Manning, 2014) builds a word-word co-occurrence matrix in a single pass over the corpus, then fits a weighted least-squares regression to its log-counts using a closed-form bilinear model. The original GloVe paper reported 75.0% on the Google analogy task with 1.6B tokens and 300-dimensional vectors, an 11-percentage-point improvement over the Skip-gram numbers Mikolov reported at comparable scale[27]. Subsequent controlled comparisons (Levy, Goldberg, and Dagan, 2015) found that once hyperparameters are matched, the methods perform within a few points of each other and the choice depends mostly on which pre-trained vectors happen to be available[17].
FastText (Bojanowski, Grave, Joulin, Mikolov, 2017) extends Skip-gram by representing each word as the sum of the embeddings of its character n-grams (typically n = 3 to 6) plus an embedding of the whole word[11]. For the word where with n = 3, FastText sums vectors for the n-grams <wh, whe, her, ere, re> and a vector for the whole token. This subword representation gives FastText two structural advantages over word2vec: it can produce embeddings for out-of-vocabulary words from their character n-grams, and it handles morphologically rich languages (Turkish, Finnish, Russian, Czech, German) far better. Facebook later released pre-trained FastText vectors in 157 languages from Common Crawl and Wikipedia[30].
word2vec's release set off a wave of related embedding models, both extensions and competitors.
An early and widely discussed finding about word2vec was that its embeddings encode social biases present in the training corpus. Bolukbasi, Chang, Zou, Saligrama, and Kalai (NIPS 2016) showed that the GoogleNews-vectors-negative300 vectors reproduce gender stereotypes geometrically: solving the analogy man : computer programmer :: woman : ? returns homemaker, and many similar gender-stereotyped associations are recoverable as a single gender direction in the embedding space[36]. They formalized this as the difference vector between gendered pairs (he/she, man/woman, etc.) and proposed a debiasing procedure that projects gender-neutral words orthogonal to that direction while preserving the gender-defining vocabulary.
Caliskan, Bryson, and Narayanan (Science, 2017) extended the analysis with the Word Embedding Association Test (WEAT), a direct adaptation of the psychometric Implicit Association Test. They showed that word2vec and GloVe embeddings recover effect sizes and directions for nearly every documented IAT bias, from gender and career associations to racial associations and pleasantness ratings of insects versus flowers[37]. Their result was particularly significant because it demonstrated that biases were not artifacts of any one corpus but reflections of statistical regularities in any large body of human text.
Subsequent work tempered some of the optimism around debiasing. Gonen and Goldberg (NAACL 2019), in a paper pointedly titled "Lipstick on a Pig," demonstrated that the Bolukbasi-style projection methods reduce the most direct geometric signal of bias but leave a substantial residual: gender-neutralized words remain clusterable along the gender axis, and a downstream classifier can recover gender from the supposedly debiased vectors with very high accuracy[38]. This work motivated a research agenda focused on causal and counterfactual approaches to fairness in embeddings, rather than purely geometric postprocessing.
word2vec produces a single static vector per word, so it cannot represent polysemy: the word "bank" gets one embedding whether it appears in a sentence about rivers or about finance. This is the fundamental limitation of the entire 2013-2017 generation of static word embeddings (word2vec, GloVe, FastText) and the limitation that motivated contextual embeddings.
ELMo (Peters et al., NAACL 2018) used a bidirectional LSTM language model to produce embeddings that depend on the entire surrounding sentence, so "bank" in river bank and bank account receive different vectors[5]. A few months later, BERT (Devlin et al., 2018) replaced the LSTM with a deep Transformer trained with masked language modeling and next-sentence prediction, decisively raising the bar across nearly every NLP benchmark[6]. By 2020 the standard NLP pipeline had shifted from "static word2vec or GloVe embeddings into a task-specific model" to "fine-tune a pre-trained Transformer encoder," and the term "word embedding" itself increasingly meant a hidden representation inside such a model rather than a row of a lookup table.
Despite being eclipsed at the frontier, word2vec is far from extinct as of 2026: