word2vec

Machine Learning Natural Language Processing

35 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

39 citations

Revision

v9 · 6,947 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

What is word2vec?

word2vec is a family of shallow neural network models, published by Tomas Mikolov and colleagues at Google in 2013, that learns dense vector representations of words from large unlabeled text corpora so that words used in similar contexts end up close together in vector space. Its single most-cited property is that simple arithmetic on the resulting vectors recovers meaning: vec("king") - vec("man") + vec("woman") returns a vector whose nearest neighbor is vec("queen")^[1]^[2]. word2vec defines two architectures, Continuous Bag-of-Words (CBOW) and Skip-gram, and it is the foundational neural word embedding method that started the modern era of learning reusable language representations cheaply from raw text.

word2vec was introduced in two influential 2013 papers: "Efficient Estimation of Word Representations in Vector Space" (arXiv:1301.3781, January 2013), authored by Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean, which proposed the two core architectures CBOW and Skip-gram^[1]; and "Distributed Representations of Words and Phrases and their Compositionality" (NIPS 2013, arXiv:1310.4546, October 2013), which added Ilya Sutskever as a co-author and introduced the negative sampling and subsampling techniques that made very large scale training practical^[2]. The first paper reports that "it takes less than a day to learn high quality word vectors from a 1.6 billion words data set," a throughput that turned word embedding from a research curiosity into a commodity^[1]. The second paper received the NeurIPS Test of Time Award in 2023, by which point it had been "cited over 40,000 times"^[2]^[3].

It demonstrated that very simple log-bilinear architectures, trained on billions of words, could produce vectors whose geometry encodes both semantic and syntactic relationships. Its most celebrated property is vector arithmetic over meanings, with the now-canonical relation vec("king") - vec("man") + vec("woman") returning a vector whose nearest neighbor is vec("queen")^[1]^[2]. The release of Google's open-source C implementation and the pre-trained GoogleNews-vectors-negative300 model^[4] made high-quality embeddings freely available to the research community, and the method became one of the most widely deployed building blocks in natural language processing (NLP) for the next several years. While word2vec has largely been superseded for state-of-the-art NLP by contextual embeddings such as ELMo (2018)^[5] and BERT (2018)^[6], it remains a workhorse for information retrieval, recommendation, and classical NLP pipelines, and is still a standard teaching tool^[7].

What came before word2vec?

Before word2vec, NLP systems most commonly represented words as sparse, high-dimensional vectors. One-hot encoding assigns each word a vector of length equal to the vocabulary size with a single 1 and the rest zeros; under this scheme all distinct words are equidistant, so the representation carries no information about meaning. Term-weighting schemes such as TF-IDF augment a sparse vector with corpus statistics but still treat distinct word types as orthogonal.

Distributional methods improve on this by extracting structure from word co-occurrence. Latent Semantic Analysis (LSA) builds a term-document matrix and reduces it via singular value decomposition; Hyperspace Analogue to Language (HAL) and similar methods factorize term-term co-occurrence matrices. These count-based approaches operationalize the distributional hypothesis, Firth's 1957 observation that "you shall know a word by the company it keeps," but they require materializing very large matrices and the resulting dimensions do not, in general, align cleanly with interpretable linguistic relations.

A parallel line of work used neural networks. Bengio et al. (2003) proposed a neural probabilistic language model whose hidden layer doubled as a learned distributed word representation, simultaneously learning a continuous representation for each word and a probability function for word sequences expressed in terms of those representations^[8]. The Bengio approach fights the curse of dimensionality by tying generalization to similarity in vector space rather than to surface n-gram overlap. Collobert and Weston (2008, 2011) extended this idea with a single convolutional architecture trained jointly on part-of-speech tagging, chunking, named entity recognition, and semantic role labeling using a shared embedding lookup table; their multi-task system showed that generic embeddings pre-trained on unlabeled text could match or beat heavily engineered task-specific baselines^[9]. Andriy Mnih and Geoffrey Hinton's hierarchical log-bilinear language model and Turian et al.'s comparative study of pre-trained representations across NLP tasks were also part of the immediate background. However, these neural approaches were computationally expensive: their hidden nonlinearities and full softmax outputs scaled poorly, restricting them to vocabularies and corpora orders of magnitude smaller than what was readily available on the web.

Mikolov's contribution was to strip the architecture down to the minimum necessary to learn high-quality vectors, and to combine that simpler model with engineering tricks that drove the training cost low enough to use corpora of billions of tokens, turning word embedding from a research curiosity into a commodity^[1]. The model has no hidden non-linearity at all: it is a pure log-bilinear projection, with all of the representational capacity concentrated in two embedding matrices of size V x d. By eliminating the per-step matrix multiplication through a hidden layer of hundreds of units, training throughput rose by orders of magnitude.

Who created word2vec, and what is its history?

The word2vec project was led by Tomas Mikolov, then a researcher in the Brain team at Google in Mountain View. Mikolov had earned his PhD at Brno University of Technology in the Czech Republic with a thesis on recurrent neural network language models, work that produced the original RNNLM toolkit^[10]. Before joining Google he had also worked at Microsoft Research. His Google co-authors were Kai Chen, Greg Corrado, and Jeff Dean on the January 2013 paper, with Ilya Sutskever added on the October 2013 paper.

Mikolov moved to Facebook AI Research in 2014, where he continued working on word embeddings, co-authoring FastText^[11] and the StarSpace generalization^[12], and on language modeling more broadly. In March 2020 he returned to the Czech Republic as a Senior Research Scientist at the Czech Institute of Informatics, Robotics and Cybernetics (CIIRC) at Czech Technical University in Prague, where he now leads a research group on foundational language models and complex systems^[10]^[13].

The reception of the first paper was not initially smooth. The January 2013 manuscript was submitted to the inaugural ICLR conference and rejected, drawing "strong reject" reviews from multiple reviewers; one objection was that the model discards word order, another sought to compel additional citations^[14]. The paper was admitted to the ICLR 2013 workshop track that May in Scottsdale, Arizona, and the open-source release of Google's C code was reportedly delayed by months while Mikolov navigated Google's internal approval process^[7]^[14]. Mikolov has publicly noted that word2vec is now probably more cited than every paper accepted to the main ICLR 2013 track combined^[14].

The pair of word2vec papers attracted citations almost immediately. By the time of the NeurIPS 2023 Test of Time Award, the 2013 NeurIPS paper alone had accumulated more than 40,000 citations, and the prize was accepted in person by Jeffrey Dean and Greg Corrado on behalf of all five authors^[3]^[15]. The NeurIPS citation framed the work as having "catalyzed progress that marked the beginning of a new era in natural language processing" by "demonstrating the power of learning from large amounts of unstructured text"^[3].

What are the two word2vec papers?

word2vec is associated with two papers from 2013 that contributed different pieces of the now-standard recipe.

The first paper, "Efficient Estimation of Word Representations in Vector Space," was posted to arXiv on 16 January 2013 (arXiv:1301.3781) and presented at the ICLR 2013 workshop track in Scottsdale, Arizona that May^[1]. Its abstract states the central claim directly: "We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set"^[1]. Its central contributions are the CBOW and Skip-gram architectures and an empirical demonstration that, on a new analogy benchmark the authors released, these simple log-bilinear models trained on 6 billion tokens outperformed neural network language models trained on far less data and at far higher cost. The paper introduced the now-famous syntactic and semantic analogy task and reported that, on the same 783 million word training set used for prior NNLM and RNNLM baselines, 300-dimensional Skip-gram reached 50.0% semantic and 55.9% syntactic accuracy where the contemporaneous RNNLM reached only 8.6% semantic^[1]. Scaling Skip-gram to 1000 dimensions and 6 billion tokens pushed the overall total to 65.6%^[1].

The second paper, "Distributed Representations of Words and Phrases and their Compositionality," was posted to arXiv on 16 October 2013 (arXiv:1310.4546) and published at NIPS 2013^[2]. It opens by describing Skip-gram as "an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships," and motivates phrase learning with a concrete example: "the meanings of 'Canada' and 'Air' cannot be easily combined to obtain 'Air Canada'"^[2]. Its contributions are training-side optimizations and extensions:

Negative sampling as a cheap alternative to the full softmax (and to hierarchical softmax).
Subsampling of frequent words to accelerate training and shift the model's attention to informative co-occurrences.
Phrase learning via a simple co-occurrence score that lets the model learn embeddings for multi-word expressions such as New York or Toronto Maple Leafs.
An expanded analogy benchmark and the observation that the learned vectors are additively compositional (vec("Russia") + vec("river") is close to vec("Volga River"))^[2].

Both papers were submitted while all authors were at Google, and Google released the reference C implementation as open source under the URL code.google.com/p/word2vec (now hosted on the Google Code Archive)^[4]. The repository was also covered by a US patent assigned to Google (US 9,037,464 B1, filed January 2013, granted May 2015), an unusual situation that did not impede academic use given the simultaneous open-source release^[16].

What are the CBOW and Skip-gram architectures?

word2vec defines two architectures that share a common skeleton: an input embedding matrix W ∈ R^{V×d} mapping each of V vocabulary items to a d-dimensional vector, a single linear projection layer (no hidden nonlinearity), and an output matrix W' ∈ R^{V×d}. The training pairs are extracted by sliding a context window of width c across the corpus. The two architectures differ in which side of the (target, context) relationship is the input and which is the prediction target.

A key architectural decision is that each word holds two distinct vectors, an input vector (sometimes called the target vector) used when the word appears as the focal token and an output vector (the context vector) used when the word appears in a window around some other token. After training, the standard practice is to discard the output matrix and use only the input embeddings, but a few later analyses recommend averaging the two matrices for slightly higher quality on some tasks^[17]. Vector dimensionality d is a hyperparameter; on Google News, the released embeddings use d = 300, and quality typically improves with dimensionality up to roughly 300-1000 before saturating^[2]^[7].

Skip-gram

Skip-gram takes the target word as input and predicts each surrounding context word independently. For a sentence w_1, w_2, ..., w_T and window radius c, the training objective is

J = (1/T) sum_{t=1..T} sum_{-c <= j <= c, j != 0} log p(w_{t+j} | w_t)

where the conditional p(w_O | w_I) is modeled with a softmax over v_{w_O}^T · v_{w_I} (with separate input/output embeddings)^[1]^[2]. Each occurrence of a word w_t therefore produces up to 2c training pairs (w_t -> w_{t+j}), so even rare words generate many gradient updates per occurrence. In practice Skip-gram is the variant most associated with the word2vec name: it tends to produce slightly higher-quality semantic embeddings, particularly for rare words and on the analogy benchmark, at the cost of training time roughly linear in the window size^[1]^[2]. The original C implementation also samples the window radius uniformly between 1 and c for each token, which effectively gives closer context words a higher weight without adding a learned weighting scheme^[4]^[17].

The per-update training cost of Skip-gram with full softmax is Q = C · (D + D · log2(V)) when paired with hierarchical softmax, where C is the expected window size and D the embedding dimension^[1]. Replacing the inner D · log2(V) term with k · D for negative sampling yields the variant that dominates in practice.

CBOW

Continuous Bag-of-Words inverts the prediction: the context predicts the target. The model averages the input embeddings of the surrounding context words within the window, projects the result through the output matrix, and is trained with a softmax to assign high probability to the true center word^[1]. The objective is

J = (1/T) sum_{t=1..T} log p(w_t | w_{t-c}, ..., w_{t-1}, w_{t+1}, ..., w_{t+c}).

"Bag-of-words" reflects the architectural choice to average context vectors symmetrically, discarding the order of context positions. CBOW is faster than Skip-gram because each window produces a single prediction rather than 2c, and it tends to give slightly better representations for very frequent words, since multiple context observations are smoothed in a single update^[1]. The CBOW per-update complexity is Q = N · D + D · log2(V) with hierarchical softmax, where N is the number of context words^[1]. Skip-gram is the more common default in downstream applications, particularly for tasks that depend on rare-word quality.

A side-by-side comparison:

Property	CBOW	Skip-gram
Input	Bag of context words	Single target word
Output	Single target word	Each context word
Training updates per occurrence	One	Up to 2c
Training speed	Faster	Slower
Frequent-word quality	Slightly better	Slightly worse
Rare-word and semantic quality	Slightly worse	Slightly better
Recommended context window	5 (default)	10 (default)

The recommended context window sizes (5 for CBOW, 10 for Skip-gram) are the defaults shipped with the original C tool and reflect the asymmetry of the two objectives^[4]^[7].

How is word2vec trained?

For a vocabulary of V words, frequently in the millions for web-scale corpora, the obvious softmax denominator (a sum of V exponentials) makes naive training infeasible. The October 2013 paper introduced two practical alternatives, with a third trick (subsampling) addressing a separate problem with the data distribution^[2].

Hierarchical softmax

Hierarchical softmax replaces the flat output layer with a binary tree whose leaves are the vocabulary words. A logistic regression at each internal node decides whether to go left or right, and p(w | context) is the product of those binary probabilities along the unique root-to-leaf path for w. With a balanced tree the cost of one update drops from O(V) to O(log V); the original paper uses a Huffman tree keyed on word frequency, so the expected path length for a randomly drawn target is even shorter than a balanced binary tree^[1]^[18]. For a 1-million-word vocabulary this reduces the inner loop from a million dot products to roughly 20. Each internal node carries its own learned vector that participates in the path probability, so the total parameter count for the tree is (V - 1) · d, roughly matching the parameter count of the original output matrix it replaces^[18].

Hierarchical softmax tends to give better results than negative sampling for rare words and small training corpora, where the negative-sampling estimator's variance is highest^[2]^[17]. It is also the option of choice when one needs a properly normalized distribution rather than a similarity score, since the tree induces a valid probability over the entire vocabulary.

Negative sampling

Negative sampling (NEG) is the better-known approach and is the default in most word2vec implementations. Instead of computing a normalized distribution over the whole vocabulary, it casts each (target, context) example as a logistic classification problem: distinguish the true context word from k noise words sampled from a noise distribution. The objective for a single positive pair (w_I, w_O) is

log sigma(v_{w_O}^T · v_{w_I}) + sum_{i=1..k} E_{w_n ~ P_n} [log sigma(-v_{w_n}^T · v_{w_I})].

Mikolov et al. recommend k = 5-20 for small training datasets and k = 2-5 for very large ones^[2]. The noise distribution P_n(w) is the empirical unigram distribution U(w) raised to the 3/4 power and renormalized; this exponent was selected empirically by trying several variants and gives rare words a relatively higher chance of being picked as negatives than U(w) would, improving their embeddings^[2]^[19]. The 3/4 exponent is also widely used as the noise distribution for descendants of word2vec including FastText, item2vec, and node2vec^[20].

Negative sampling is a coarser approximation of softmax than noise-contrastive estimation (NCE) from which it descends, and Goldberg and Levy's companion derivation in 2014 made the relationship precise: NEG corresponds to the limit case where the partition function is treated as a constant and the noise distribution is reweighted, sacrificing strict probabilistic interpretation in exchange for very fast training^[19]. Levy and Goldberg's later NIPS 2014 analysis showed that, in the limit of unlimited capacity, Skip-gram with negative sampling (often abbreviated SGNS) is implicitly factorizing a matrix of shifted PMI values (see "How is word2vec understood theoretically?" below)^[21].

Subsampling of frequent words

In any natural corpus the unigram distribution is dominated by a handful of function words (the, of, a, and), each of which appears in millions of windows but provides little information about meaning. word2vec subsamples these by discarding each occurrence of a word w with probability

P(discard | w) = 1 - sqrt(t / f(w)),

where f(w) is the relative frequency of w and t is a threshold typically around 10^-5^[2]. The discard probability is essentially zero for moderately rare words and approaches 1 for the most common tokens. The practical effect is a 2-10x speed-up in training, a small expansion of the effective window (since many "the"-tokens are dropped, the surviving context words are sometimes farther apart in the original text), and a measurable improvement on rare-word similarity and analogy tasks^[2]. The formula keeps the relative frequency ranking intact while sharply down-weighting the very top of the unigram distribution.

A separate but related trick used in the C implementation is dynamic context window sampling: for each token, the actual window radius is sampled uniformly in [1, c]. This implicitly weights context words by inverse distance and is sometimes credited with a sizable share of word2vec's quality advantage over simpler co-occurrence baselines^[17].

Why does king - man + woman = queen?

The clearest demonstration of word2vec's structure is the analogy task the papers used as a benchmark. Given an analogy A : B :: C : ?, the model returns

argmax_{w in V, w != A, B, C}  cos( v_w,  v_B - v_A + v_C ).

This 3CosAdd rule operationalizes the parallelogram model of analogy: the vector offset v_B - v_A is taken to capture the relation between A and B, and adding it to v_C is expected to land near the analogous word. When trained on enough data with Skip-gram or CBOW, these computations recover plausible answers for a wide range of relationships^[1]^[2]:

Semantic analogies:
- vec("king") - vec("man") + vec("woman") ≈ vec("queen")
- vec("Paris") - vec("France") + vec("Italy") ≈ vec("Rome")
- vec("Berlin") - vec("Germany") + vec("Japan") ≈ vec("Tokyo")
- vec("Madrid") - vec("Spain") + vec("Russia") ≈ vec("Moscow")
Syntactic analogies:
- vec("walking") - vec("walk") + vec("swim") ≈ vec("swimming")
- vec("cars") - vec("car") + vec("apple") ≈ vec("apples")
- vec("biggest") - vec("big") + vec("small") ≈ vec("smallest")

To benchmark this, Mikolov et al. assembled a 19,544-question analogy test set (the "Google analogy test set" distributed as questions-words.txt) covering 14 relation types: 5 semantic categories (capital-common-countries, capital-world, currency, city-in-state, family) and 9 syntactic categories (adjective-to-adverb, opposite, comparative, superlative, present-participle, nationality-adjective, past-tense, plural, plural-verbs)^[22]. The test set contains 8,869 semantic and 10,675 syntactic questions, with each relation represented by 20-100 example pairs combined into all-vs-all analogies. On 6B tokens, 1000-dimensional Skip-gram achieves about 65.6% accuracy overall on this benchmark^[1]. The October 2013 paper extends the analysis to phrases, showing for instance that vec("Vietnam") + vec("capital") is close to vec("Hanoi") and that vector addition behaves like a soft semantic conjunction^[2].

Is the analogy benchmark reliable?

The analogy benchmark drew justified scepticism by 2016. Linzen (2016) and Rogers et al. observed that the standard evaluation procedure explicitly excludes the three input words A, B, C from the candidate set, which inflates accuracy: a substantial fraction of the apparent successes occur because the input word would otherwise have been the closest match^[23]. Drozd, Gladkova, and Matsuoka in their 2016 COLING paper "Word Embeddings, Analogies, and Machine Learning: Beyond king - man + woman = queen" showed that the simple 3CosAdd rule misses relational information that the embeddings clearly do encode but that can be recovered only by more sophisticated readout methods, and they proposed an alternative LRCos method that uses logistic regression over cosine features to substantially outperform vector arithmetic^[24]. Allen and Hospedales (2019) provided a partial theoretical account of why analogies work at all, deriving conditions under which the parallelogram approximation is exact for jointly co-occurring relations^[25].

The qualitative geometric phenomenon, that consistent relations correspond to roughly parallel vector offsets, is robust and was the main observation that made word2vec famous. But the standard 3CosAdd evaluation is now understood to overstate the regularity of the embedding space, and modern evaluations supplement it with relation prediction, lexical entailment, and probing tasks.

What are the GoogleNews pre-trained vectors?

Alongside the code, Google released a set of pre-trained vectors that became the de facto default embedding for an entire era of NLP. The file GoogleNews-vectors-negative300.bin.gz is roughly 1.6 GB compressed (about 3.4 GB uncompressed) and contains 300-dimensional vectors for 3 million words and phrases, trained on a Google News corpus of approximately 100 billion tokens using Skip-gram with negative sampling^[4]^[26]. The "phrases" included in the vocabulary are the multi-word expressions discovered by the phrase-learning procedure of the October 2013 paper.

The file format is a simple binary header (<vocab_size> <dim>\n) followed by each word as ASCII, a space, and dim * 4 bytes of single-precision floats. This compact format and its bundled C reader made the model trivial to load from C, C++, Python (via gensim), Java, and just about every other language with a binding. It is widely available on package mirrors and is the embedding most often referenced when older NLP papers say simply "we initialize with word2vec"^[26].

Other notable pre-trained word2vec releases include domain-specific vectors trained on PubMed/MEDLINE biomedical text (e.g., the BioASQ vectors and Pyysalo's PubMed-PMC release), the freebase entity vectors that Google bundled with the original release, and the Wikipedia + Gigaword Skip-gram release distributed through the gensim project. Many of these community releases were trained by users with the same C code and uploaded for redistribution.

How is word2vec understood theoretically?

For two years after publication, word2vec's empirical success was widely seen as something of a black box. A few hyperparameter choices (the 3/4 power, the 10^-5 subsampling threshold, the choice of negative sampling over hierarchical softmax) had been arrived at empirically, and the theoretical relationship between the new "predictive" embeddings and the older "count-based" methods such as LSA was unclear.

In 2014, Omer Levy and Yoav Goldberg published "Neural Word Embedding as Implicit Matrix Factorization" at NIPS, which provided a clean answer for Skip-gram with negative sampling: in the limit of unbounded vector dimension, SGNS is implicitly factorizing the word-context matrix whose (w, c) entry is

PMI(w, c) - log k,

where PMI(w, c) = log [ #(w, c) · |D| / (#(w) · #(c)) ] is pointwise mutual information and k is the number of negative samples^[21]. In other words, word2vec's predictive training is mathematically equivalent, at optimum and for sufficient dimension, to factorizing a shifted-PMI matrix, the same family of matrices that distributional or count-based methods had been factorizing for decades. Levy and Goldberg further showed that an explicit SVD on the same matrix produces embeddings of comparable quality to SGNS on standard benchmarks once a few preprocessing details (positive PMI, context smoothing) are matched^[21]. A follow-on study, "Improving Distributional Similarity with Lessons Learned from Word Embeddings" (Levy, Goldberg, and Dagan, TACL 2015), demonstrated that most of word2vec's gains over earlier methods came from hyperparameter choices and the engineering tricks of subsampling, negative-sampling smoothing, and the 3/4 exponent rather than from the use of neural networks as such^[17].

This result, together with the contemporaneous GloVe paper which derived embeddings explicitly from log co-occurrence counts^[27], unified the predictive and count-based families and clarified why word2vec worked so well. A subsequent line of work by Arora, Liang, Ma, Risteski and others (the random-walk PMI model) gave a generative justification for both the embedding geometry and the parallelogram analogy property^[28].

What implementations of word2vec exist?

Original C code

Google's reference implementation was released under the Apache License 2.0 at https://code.google.com/p/word2vec and is preserved on the Google Code Archive^[4]. The code is roughly 700 lines of single-file C, multi-threaded with POSIX threads, and uses a precomputed sigmoid lookup table plus aggressive SSE-friendly memory layout to train Skip-gram on the 100B-token Google News corpus in roughly one day on a single multi-core CPU^[4]^[7]. The same repository contains shell scripts for the analogy benchmark (questions-words.txt), the phrase-learning utility (word2phrase), and links to the GoogleNews binary download. Many later C/C++ implementations are direct forks or close ports.

gensim

In Python, the most widely used implementation is part of gensim, the topic-modeling and embeddings library created by Radim Řehůřek. Gensim's gensim.models.Word2Vec exposes CBOW and Skip-gram with both hierarchical softmax and negative sampling, supports streaming corpora that do not fit in RAM, and uses Cython-compiled inner loops to achieve roughly the same throughput as the original C code (the documentation cites a ~70x speedup over a pure-NumPy implementation)^[29]. Gensim also provides loaders for Google's binary format (KeyedVectors.load_word2vec_format), tools for analogy and similarity benchmarks, and follow-on implementations of doc2vec and FastText. As of 2026 gensim remains the most widely used non-C word2vec implementation, and the KeyedVectors interface is the lingua franca for sharing static embeddings of any provenance.

TensorFlow and PyTorch tutorials

TensorFlow, PyTorch, and most other deep-learning frameworks ship word2vec as one of their canonical "first models" tutorials. The TensorFlow team in particular maintained tutorials/text/word2vec and the older tensorflow/models Skip-gram-with-NCE example; these are pedagogically useful but slower than gensim or the C code at training scale, because they retain the framework overhead unnecessary for such a tiny model.

Other implementations

Notable third-party implementations include DL4J's Word2Vec for the JVM, Spark MLlib's distributed Word2Vec for cluster training, Hugging Face's word vector support inside the tokenizers and transformers ecosystems, and a long tail of Rust, Go, and JavaScript ports that target specific deployment targets such as browsers and embedded systems.

How does word2vec differ from GloVe and FastText?

word2vec's most direct contemporaries are GloVe (Stanford NLP, EMNLP 2014) and FastText (Facebook AI Research, TACL 2017). The three methods produce static word vectors of broadly comparable quality, but they differ in mechanism and in the cases where one is clearly preferable.

Property	word2vec (SGNS)	GloVe	FastText
Year	2013	2014	2016/2017
First author	Mikolov	Pennington	Bojanowski
Affiliation	Google	Stanford	Facebook
Objective	Predict context (or target)	Weighted least squares on log co-occurrence	Skip-gram over character n-grams
Input unit	Word	Word	Subword n-grams + word
Handles OOV	No	No	Yes (composes from n-grams)
Default pretrained vectors	GoogleNews 300d, 3M vocab	Common Crawl 300d, 1.9M vocab	Wikipedia+CC 300d, 157 languages
Trains in passes over corpus	Yes	No (precomputed matrix)	Yes
Memory dominated by	Vocabulary	Co-occurrence matrix	Subword + word tables

GloVe (Pennington, Socher, Manning, 2014) builds a word-word co-occurrence matrix in a single pass over the corpus, then fits a weighted least-squares regression to its log-counts using a closed-form bilinear model. The original GloVe paper reported 75.0% on the Google analogy task with 1.6B tokens and 300-dimensional vectors, an 11-percentage-point improvement over the Skip-gram numbers Mikolov reported at comparable scale^[27]. Subsequent controlled comparisons (Levy, Goldberg, and Dagan, 2015) found that once hyperparameters are matched, the methods perform within a few points of each other and the choice depends mostly on which pre-trained vectors happen to be available^[17].

FastText (Bojanowski, Grave, Joulin, Mikolov, 2017) extends Skip-gram by representing each word as the sum of the embeddings of its character n-grams (typically n = 3 to 6) plus an embedding of the whole word^[11]. For the word where with n = 3, FastText sums vectors for the n-grams <wh, whe, her, ere, re> and a vector for the whole token. This subword representation gives FastText two structural advantages over word2vec: it can produce embeddings for out-of-vocabulary words from their character n-grams, and it handles morphologically rich languages (Turkish, Finnish, Russian, Czech, German) far better. Facebook later released pre-trained FastText vectors in 157 languages from Common Crawl and Wikipedia^[30].

What models followed from word2vec?

word2vec's release set off a wave of related embedding models, both extensions and competitors.

doc2vec or Paragraph Vectors (Le and Mikolov, ICML 2014) generalizes the same training framework to entire documents by adding a per-paragraph vector that participates in the prediction alongside word vectors; the resulting paragraph vectors can be used for sentiment classification, retrieval, and other document-level tasks^[31].
StarSpace (Wu, Fisch, Chopra, Adams, Bordes, Weston, AAAI 2018) is a Facebook-developed generalization that learns to embed all kinds of objects, words, sentences, documents, entities, graph nodes, users, and items, into a single common space using a margin-based contrastive objective directly descended from word2vec's negative sampling^[12].
item2vec (Barkan and Koenigstein, 2016) applies SGNS to user-item interaction sequences for collaborative filtering, treating purchase or click sessions as "sentences" of item IDs^[32]. The method became a standard baseline in recommendation systems.
prod2vec (Grbovic et al., KDD 2015) and meta-prod2vec trained product embeddings from Yahoo Mail commercial sessions and were the first deployments of word2vec-style embeddings to power production recommendations at scale.
node2vec (Grover and Leskovec, KDD 2016) and DeepWalk (Perozzi et al., KDD 2014) apply SGNS to sequences of nodes obtained by random walks on a graph, producing node embeddings useful for link prediction, community detection, and downstream classification^[33]. Qiu et al. (2018) showed that these graph methods are all implicitly factorizing a matrix related to the graph Laplacian, mirroring Levy and Goldberg's earlier result for SGNS^[34].
wang2vec, lemma2vec, and many language-specific variants train word2vec on lemmatized or morphologically segmented corpora to mitigate vocabulary explosion in inflected languages.
Cross-lingual embedding methods (Mikolov, Le, Sutskever, 2013; Faruqui and Dyer, 2014; Conneau et al., 2017) align word2vec vector spaces from different languages into a shared geometry, enabling unsupervised bilingual dictionary induction and zero-shot translation^[35].
dna2vec, prot2vec, gene2vec, mol2vec and a long catalogue of bioinformatics variants apply the same SGNS recipe to DNA k-mers, protein sequences, gene expression vectors, and molecular fingerprints, producing embeddings that improve downstream prediction tasks in computational biology.

An early and widely discussed finding about word2vec was that its embeddings encode social biases present in the training corpus. Bolukbasi, Chang, Zou, Saligrama, and Kalai (NIPS 2016) showed that the GoogleNews-vectors-negative300 vectors reproduce gender stereotypes geometrically: solving the analogy man : computer programmer :: woman : ? returns homemaker, and many similar gender-stereotyped associations are recoverable as a single gender direction in the embedding space^[36]. They formalized this as the difference vector between gendered pairs (he/she, man/woman, etc.) and proposed a debiasing procedure that projects gender-neutral words orthogonal to that direction while preserving the gender-defining vocabulary.

Caliskan, Bryson, and Narayanan (Science, 2017) extended the analysis with the Word Embedding Association Test (WEAT), a direct adaptation of the psychometric Implicit Association Test. They showed that word2vec and GloVe embeddings recover effect sizes and directions for nearly every documented IAT bias, from gender and career associations to racial associations and pleasantness ratings of insects versus flowers^[37]. Their result was particularly significant because it demonstrated that biases were not artifacts of any one corpus but reflections of statistical regularities in any large body of human text.

Subsequent work tempered some of the optimism around debiasing. Gonen and Goldberg (NAACL 2019), in a paper pointedly titled "Lipstick on a Pig," demonstrated that the Bolukbasi-style projection methods reduce the most direct geometric signal of bias but leave a substantial residual: gender-neutralized words remain clusterable along the gender axis, and a downstream classifier can recover gender from the supposedly debiased vectors with very high accuracy^[38]. This work motivated a research agenda focused on causal and counterfactual approaches to fairness in embeddings, rather than purely geometric postprocessing.

Is word2vec still used?

word2vec produces a single static vector per word, so it cannot represent polysemy: the word "bank" gets one embedding whether it appears in a sentence about rivers or about finance. This is the fundamental limitation of the entire 2013-2017 generation of static word embeddings (word2vec, GloVe, FastText) and the limitation that motivated contextual embeddings.

ELMo (Peters et al., NAACL 2018) used a bidirectional LSTM language model to produce embeddings that depend on the entire surrounding sentence, so "bank" in river bank and bank account receive different vectors^[5]. A few months later, BERT (Devlin et al., 2018) replaced the LSTM with a deep Transformer trained with masked language modeling and next-sentence prediction, decisively raising the bar across nearly every NLP benchmark^[6]. By 2020 the standard NLP pipeline had shifted from "static word2vec or GloVe embeddings into a task-specific model" to "fine-tune a pre-trained Transformer encoder," and the term "word embedding" itself increasingly meant a hidden representation inside such a model rather than a row of a lookup table.

Despite being eclipsed at the frontier, word2vec is far from extinct as of 2026:

Information retrieval and recommendation systems still use Skip-gram-with-negative-sampling and its descendants for learning embeddings of queries, documents, users, and items; the training pipeline is fast, easy to parallelize, and produces vectors small enough to serve at billions-of-requests scale.
Resource-constrained NLP, on devices, in low-latency services, in languages with limited compute, frequently uses word2vec or FastText embeddings as inputs to small classification or sequence models.
Domain-specific corpora (biomedical literature, legal text, code, music playlists, protein sequences) often have their own bespoke word2vec-style embeddings trained on the in-domain corpus, and these can outperform large pre-trained models that have not seen the same vocabulary.
Teaching: word2vec remains the canonical first example in NLP and machine-learning curricula, because the architecture is simple enough to derive end-to-end yet illustrates the core distributional-semantics insight that powered everything afterwards.
Baseline for LLM embeddings: a 2024-2025 series of "Revisiting Word Embeddings in the LLM Era" studies compared word vectors extracted from decoder-only LLMs to classical SGNS, finding that LLM-derived embeddings tightly cluster semantically related words and dominate on decontextualized similarity, while classical SGNS-based models like SimCSE remain competitive or superior on certain fine-grained sentence-level tasks^[39].
Conceptual influence: the entire pre-train/transfer paradigm that powers modern foundation models can be traced through ELMo and BERT back to word2vec's demonstration that useful representations can be learned cheaply from raw text. The NeurIPS 2023 Test of Time citation explicitly recognized word2vec as the work that "ushered in a new era in natural language processing"^[3].

References

Mikolov, T., Chen, K., Corrado, G., & Dean, J. "Efficient Estimation of Word Representations in Vector Space." Proceedings of the ICLR 2013 Workshop, 2013-01-16. arXiv:1301.3781. https://arxiv.org/abs/1301.3781. Accessed 2026-05-24. ↩
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. "Distributed Representations of Words and Phrases and their Compositionality." Advances in Neural Information Processing Systems 26 (NIPS 2013), 2013-10-16. arXiv:1310.4546. https://arxiv.org/abs/1310.4546. Accessed 2026-05-24. ↩
NeurIPS Blog. "Announcing the NeurIPS 2023 Paper Awards." NeurIPS, 2023-12-11. https://blog.neurips.cc/2023/12/11/announcing-the-neurips-2023-paper-awards/. Accessed 2026-05-24. ↩
Google. "word2vec: Tool for computing continuous distributed representations of words." Google Code Archive, 2013. https://code.google.com/archive/p/word2vec/. Accessed 2026-05-24. ↩
Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. "Deep contextualized word representations." Proceedings of NAACL-HLT 2018, 2018-03-22. arXiv:1802.05365. https://arxiv.org/abs/1802.05365. Accessed 2026-05-24. ↩
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." arXiv:1810.04805, 2018-10-11. https://arxiv.org/abs/1810.04805. Accessed 2026-05-24. ↩
Wikipedia contributors. "Word2vec." Wikipedia, accessed 2026-05-24. https://en.wikipedia.org/wiki/Word2vec. Accessed 2026-05-24. ↩
Bengio, Y., Ducharme, R., Vincent, P., & Janvin, C. "A Neural Probabilistic Language Model." Journal of Machine Learning Research, 3, 1137-1155, 2003. https://www.jmlr.org/papers/v3/bengio03a.html. Accessed 2026-05-24. ↩
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. "Natural Language Processing (Almost) from Scratch." Journal of Machine Learning Research, 12, 2493-2537, 2011. arXiv:1103.0398. https://www.jmlr.org/papers/v12/collobert11a.html. Accessed 2026-05-24. ↩
Czech Institute of Informatics, Robotics and Cybernetics (CIIRC). "World-renowned expert Tomáš Mikolov comes to CIIRC CTU." CTU News, 2020-05-11. https://aktualne.cvut.cz/en/reports/20200511-world-renowned-expert-tomas-mikolov-comes-to-ciirc-ctu. Accessed 2026-05-24. ↩
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. "Enriching Word Vectors with Subword Information." Transactions of the Association for Computational Linguistics, 5, 135-146, 2017. arXiv:1607.04606. https://aclanthology.org/Q17-1010/. Accessed 2026-05-24. ↩
Wu, L., Fisch, A., Chopra, S., Adams, K., Bordes, A., & Weston, J. "StarSpace: Embed All The Things!" Proceedings of AAAI 2018, 2017-09-12. arXiv:1709.03856. https://arxiv.org/abs/1709.03856. Accessed 2026-05-24. ↩
CIIRC CTU. "Tomáš Mikolov, Senior Researcher." https://www.ciirc.cvut.cz/people/tomas-mikolov/. Accessed 2026-05-24. ↩
Rijcken, E. "Uncovering the Pioneering Journey of Word2Vec and the State of AI Science: an in-depth interview with Dr. Tomas Mikolov." Towards Data Science, 2023-12-17. https://towardsdatascience.com/uncovering-the-pioneering-journey-of-word2vec-and-the-state-of-ai-science-an-in-depth-interview-fbca93d8f4ff/. Accessed 2026-05-24. ↩
Dean, J. "Test of Time Award acceptance for word2vec." X (formerly Twitter), 2023-12-13. https://x.com/JeffDean/status/1734720190401634474. Accessed 2026-05-24. ↩
Mikolov, T., Chen, K., Corrado, G., & Dean, J. "Computing numeric representations of words in a high-dimensional space." US Patent 9,037,464 B1, filed 2013-01-15, granted 2015-05-19. https://patents.google.com/patent/US9037464. Accessed 2026-05-24. ↩
Levy, O., Goldberg, Y., & Dagan, I. "Improving Distributional Similarity with Lessons Learned from Word Embeddings." Transactions of the Association for Computational Linguistics, 3, 211-225, 2015. https://aclanthology.org/Q15-1016/. Accessed 2026-05-24. ↩
Morin, F., & Bengio, Y. "Hierarchical Probabilistic Neural Network Language Model." AISTATS 2005. https://proceedings.mlr.press/r5/morin05a/morin05a.pdf. Accessed 2026-05-24. ↩
Goldberg, Y., & Levy, O. "word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method." arXiv:1402.3722, 2014-02-15. https://arxiv.org/abs/1402.3722. Accessed 2026-05-24. ↩
Mikolov, T. "word2vec source code, file `word2vec.c`." Google Code Archive, 2013. https://code.google.com/archive/p/word2vec/source. Accessed 2026-05-24. ↩
Levy, O., & Goldberg, Y. "Neural Word Embedding as Implicit Matrix Factorization." Advances in Neural Information Processing Systems 27 (NIPS 2014), 2014. https://papers.nips.cc/paper/2014/hash/feab05aa91085b7a8012516bc3533958-Abstract.html. Accessed 2026-05-24. ↩
Association for Computational Linguistics Wiki. "Google analogy test set (State of the art)." ACL Wiki. https://aclweb.org/aclwiki/Google_analogy_test_set_(State_of_the_art). Accessed 2026-05-24. ↩
Linzen, T. "Issues in evaluating semantic spaces using word analogies." Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, 2016. arXiv:1606.07736. https://aclanthology.org/W16-2503/. Accessed 2026-05-24. ↩
Drozd, A., Gladkova, A., & Matsuoka, S. "Word Embeddings, Analogies, and Machine Learning: Beyond king - man + woman = queen." Proceedings of COLING 2016, 2016-12-11. https://aclanthology.org/C16-1332/. Accessed 2026-05-24. ↩
Allen, C., & Hospedales, T. "Analogies Explained: Towards Understanding Word Embeddings." Proceedings of ICML 2019, 2019-01-28. arXiv:1901.09813. https://arxiv.org/abs/1901.09813. Accessed 2026-05-24. ↩
Mihaltz, M. "word2vec-GoogleNews-vectors: word2vec Google News model." GitHub. https://github.com/mmihaltz/word2vec-GoogleNews-vectors. Accessed 2026-05-24. ↩
Pennington, J., Socher, R., & Manning, C. D. "GloVe: Global Vectors for Word Representation." Proceedings of EMNLP 2014, 2014-10-25. https://nlp.stanford.edu/pubs/glove.pdf. Accessed 2026-05-24. ↩
Arora, S., Li, Y., Liang, Y., Ma, T., & Risteski, A. "A Latent Variable Model Approach to PMI-based Word Embeddings." Transactions of the Association for Computational Linguistics, 4, 385-399, 2016. https://aclanthology.org/Q16-1028/. Accessed 2026-05-24. ↩
Řehůřek, R. "gensim: Topic modelling for humans, `models.word2vec`." Documentation, 2010 onwards. https://radimrehurek.com/gensim/models/word2vec.html. Accessed 2026-05-24. ↩
Grave, E., Bojanowski, P., Gupta, P., Joulin, A., & Mikolov, T. "Learning Word Vectors for 157 Languages." Proceedings of LREC 2018, 2018-02-19. arXiv:1802.06893. https://arxiv.org/abs/1802.06893. Accessed 2026-05-24. ↩
Le, Q., & Mikolov, T. "Distributed Representations of Sentences and Documents." Proceedings of ICML 2014, 2014-05-16. arXiv:1405.4053. https://arxiv.org/abs/1405.4053. Accessed 2026-05-24. ↩
Barkan, O., & Koenigstein, N. "Item2Vec: Neural Item Embedding for Collaborative Filtering." Proceedings of IEEE MLSP 2016, 2016-03-14. arXiv:1603.04259. https://arxiv.org/abs/1603.04259. Accessed 2026-05-24. ↩
Grover, A., & Leskovec, J. "node2vec: Scalable Feature Learning for Networks." Proceedings of KDD 2016, 2016-07-03. arXiv:1607.00653. https://arxiv.org/abs/1607.00653. Accessed 2026-05-24. ↩
Qiu, J., Dong, Y., Ma, H., Li, J., Wang, K., & Tang, J. "Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, PTE, and node2vec." Proceedings of WSDM 2018, 2017-10-09. arXiv:1710.02971. https://arxiv.org/abs/1710.02971. Accessed 2026-05-24. ↩
Mikolov, T., Le, Q. V., & Sutskever, I. "Exploiting Similarities among Languages for Machine Translation." arXiv:1309.4168, 2013-09-17. https://arxiv.org/abs/1309.4168. Accessed 2026-05-24. ↩
Bolukbasi, T., Chang, K.-W., Zou, J., Saligrama, V., & Kalai, A. "Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings." Advances in Neural Information Processing Systems 29 (NIPS 2016), 2016-07-21. arXiv:1607.06520. https://arxiv.org/abs/1607.06520. Accessed 2026-05-24. ↩
Caliskan, A., Bryson, J. J., & Narayanan, A. "Semantics derived automatically from language corpora contain human-like biases." Science, 356(6334), 183-186, 2017-04-14. https://www.science.org/doi/10.1126/science.aal4230. Accessed 2026-05-24. ↩
Gonen, H., & Goldberg, Y. "Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them." Proceedings of NAACL-HLT 2019, 2019-03-09. arXiv:1903.03862. https://aclanthology.org/N19-1061/. Accessed 2026-05-24. ↩
Sia, S., Mishra, A. K., & Duh, K. "Revisiting Word Embeddings in the LLM Era." Proceedings of IJCNLP 2025, 2025. https://aclanthology.org/2025.ijcnlp-long.145.pdf. Accessed 2026-05-24. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

8 revisions by 1 contributors · full history

Suggest edit

word2vec

What is word2vec?

What came before word2vec?

Who created word2vec, and what is its history?

What are the two word2vec papers?

What are the CBOW and Skip-gram architectures?

Skip-gram

CBOW

How is word2vec trained?

Hierarchical softmax

Negative sampling

Subsampling of frequent words

Why does king - man + woman = queen?

Is the analogy benchmark reliable?

What are the GoogleNews pre-trained vectors?

How is word2vec understood theoretically?

What implementations of word2vec exist?

Original C code

gensim

TensorFlow and PyTorch tutorials

Other implementations

How does word2vec differ from GloVe and FastText?

What models followed from word2vec?

Is word2vec still used?

See also

References

Improve this article

What links here (24 of 58)

What links here (24 of 58)

What is word2vec?

What came before word2vec?

Who created word2vec, and what is its history?

What are the two word2vec papers?

What are the CBOW and Skip-gram architectures?

Skip-gram

CBOW

How is word2vec trained?

Hierarchical softmax

Negative sampling

Subsampling of frequent words

Why does king - man + woman = queen?

Is the analogy benchmark reliable?

What are the GoogleNews pre-trained vectors?

How is word2vec understood theoretically?

What implementations of word2vec exist?

Original C code

gensim

TensorFlow and PyTorch tutorials

Other implementations

How does word2vec differ from GloVe and FastText?

What models followed from word2vec?

Does word2vec encode social bias?

Is word2vec still used?

See also

References

Improve this article

Related Articles

Prompt Engineering

Trigram

Agentic Context Engineering

BLEU (Bilingual Evaluation Understudy)

Bag of Words

Bigram

What links here (24 of 58)

Related Articles

Prompt Engineering

Trigram

Agentic Context Engineering

BLEU (Bilingual Evaluation Understudy)

Bag of Words

Bigram

What links here (24 of 58)