Embedding Layer

An embedding layer is a neural network component that functions as a trainable lookup table, mapping discrete integer indices (such as word IDs, user IDs, or category codes) to dense, continuous-valued vectors. Rather than using sparse, high-dimensional one-hot encodings, the embedding layer stores a weight matrix of shape (vocabulary_size, embedding_dimension) and retrieves the appropriate row for each input index. This operation is mathematically equivalent to multiplying a one-hot vector by a weight matrix, but the lookup-based implementation avoids the wasteful zero multiplications that come with large vocabularies.

Embedding layers are foundational in modern deep learning systems. They appear in natural language processing models (converting tokens to vectors), recommendation systems (representing users and items), and tabular data pipelines (encoding high-cardinality categorical features). The concept of learned distributed representations was introduced by Bengio et al. in their 2003 neural probabilistic language model and later popularized by standalone word embedding methods such as Word2Vec and GloVe.

Historical background

The idea of representing words as dense vectors has roots stretching back to the 1980s, when Hinton (1986) introduced the concept of "distributed representations." However, the modern embedding layer traces its lineage through several key milestones.

Bengio's neural probabilistic language model (2003)

Yoshua Bengio, Rejean Ducharme, Pascal Vincent, and Christian Jauvin published "A Neural Probabilistic Language Model" in the Journal of Machine Learning Research in 2003. This paper introduced the idea of learning word embeddings jointly with a language modeling objective. The model used a matrix C of dimensions |V| x m, where each row mapped a word to a real-valued vector in R^m. These word vectors were concatenated and fed through a feedforward neural network with a tanh hidden layer, and a softmax output layer predicted the next word. The probability function took the form y = softmax(b + Wx + U tanh(d + Hx)), where x was the concatenation of the context word embeddings. The model learned both the embedding matrix C and the network parameters simultaneously through backpropagation. This paper laid the groundwork for all subsequent word embedding research.

Word2Vec (2013)

Tomas Mikolov and colleagues at Google published "Efficient Estimation of Word Representations in Vector Space" in 2013, introducing two lightweight architectures that could train on billions of words in hours rather than weeks.

Skip-gram predicts surrounding context words given a center word. For a center word w_c, the model computes P(w_o | w_c) = exp(u_o^T v_c) / sum_i exp(u_i^T v_c), where v_c is the center word vector and u_o is the context word vector. Each word maintains two separate vectors: one used when the word appears as a center word, and another when it appears as context.
Continuous bag-of-words (CBOW) predicts a center word from the average of its surrounding context word vectors. It computes P(w_c | W_o) using the averaged context vector across the 2m surrounding words.

Both models used training optimizations to scale to large vocabularies. Negative sampling pairs each positive training example with k randomly sampled "negative" words and updates only those embeddings, reducing computation from O(V) to O(k) per step. Hierarchical softmax uses a binary Huffman tree to represent the output layer, evaluating only ln(V) nodes instead of all V.

A follow-up paper by Mikolov et al. (2013b), "Distributed Representations of Words and Phrases and their Compositionality," extended the approach to learn embeddings for common phrases and further refined the negative sampling objective.

GloVe (2014)

Jeffrey Pennington, Richard Socher, and Christopher Manning at Stanford developed GloVe (Global Vectors for Word Representation), which combined the strengths of global matrix factorization methods and local context window methods. GloVe first builds a global word-word co-occurrence matrix from the entire corpus in a single pass, then trains embeddings by minimizing a weighted least-squares objective that fits the dot product of word vectors to the logarithm of their co-occurrence count. The weighting function down-weights very frequent co-occurrences to prevent common word pairs from dominating the objective. GloVe often trains faster on large corpora because the expensive co-occurrence counting happens only once upfront.

FastText (2017)

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov at Facebook AI Research introduced FastText, which extended the skip-gram model by representing each word as a bag of character n-grams (typically 3 to 6 characters long). The embedding for a word is computed as the sum of its constituent n-gram embeddings. For example, the word "running" is decomposed into character substrings like "<ru", "run", "unn", "nni", "nin", "ing", "ng>" (with boundary markers), and its vector is the sum of these n-gram vectors. This approach handles out-of-vocabulary words (by summing their n-gram vectors), captures morphological patterns (words sharing suffixes or prefixes get similar embeddings), and produces better representations for rare words.

Timeline of embedding methods

Method	Year	Developer	Training approach	Handles OOV words	Embedding type
Bengio NPLM	2003	Universite de Montreal	Feedforward network with softmax	No	Static
Word2Vec	2013	Google	Skip-gram or CBOW with negative sampling	No	Static
GloVe	2014	Stanford NLP	Global co-occurrence matrix factorization	No	Static
FastText	2017	Meta AI (FAIR)	Skip-gram with character n-grams	Yes	Static
ELMo	2018	Allen AI	Bidirectional LSTM language model	No	Contextual
BERT	2018	Google	Masked language model with transformer	No (uses WordPiece subwords)	Contextual

How an embedding layer works

An embedding layer maintains a two-dimensional weight matrix W with shape (V, d), where V is the vocabulary size (the number of distinct items) and d is the embedding dimension (the number of values in each vector). When receiving an integer input i, the layer returns the i-th row of W. For a batch of inputs, it returns the corresponding stacked rows.

The forward pass is a pure index-based lookup with no matrix multiplication, making it computationally inexpensive. Given a sequence of token IDs [3, 17, 42], the embedding layer simply fetches rows 3, 17, and 42 from the weight matrix and returns them as a tensor of shape (3, d).

Equivalence to one-hot encoding plus a linear layer

Mathematically, the embedding lookup for index i is equivalent to computing x_onehot * W, where x_onehot is a one-hot vector with a 1 at position i and zeros elsewhere. Since multiplying a one-hot vector by a matrix simply selects one row, the result is identical. However, the embedding layer is far more efficient in practice because it avoids constructing the sparse one-hot vector and performing the full matrix multiplication. One-hot encoding a vocabulary of 100,000 tokens would require creating vectors with 100,000 elements (most of which are zeros) and then performing 99,999 multiplications by zero per input. The embedding layer skips all of this by going directly to the relevant row.

Aspect	One-hot + linear layer	Embedding layer (lookup)
Input representation	Sparse vector of size V	Single integer index
Forward pass operation	Full matrix multiplication	Row index lookup
Memory for input	O(V) per token	O(1) per token
Gradient computation	Dense gradient over full matrix	Sparse gradient (only accessed rows)
Bias term	Can include bias	Typically no bias
Mathematical result	Identical	Identical

Gradient updates and training

During backpropagation, only the rows of the embedding matrix that were accessed in the current batch receive non-zero gradients. This sparse gradient property provides a significant efficiency advantage. If a vocabulary contains 50,000 entries but only 128 unique tokens appear in a given batch, only 128 rows of the embedding matrix are updated in that training step. The remaining 49,872 rows are unchanged.

Frameworks like PyTorch support explicitly sparse gradient tensors (via the sparse=True parameter) that store only the non-zero entries, further reducing memory consumption during training. The optimizer then applies updates only to the accessed rows, using algorithms like stochastic gradient descent or Adam.

Embedding scaling in transformers

In the original transformer architecture (Vaswani et al., 2017), the embedding vectors are multiplied by the square root of the embedding dimension (sqrt(d_model)) before being added to positional encodings. This scaling factor prevents the token embeddings from being dwarfed by the positional encoding values, which use sine and cosine functions with values in the range [-1, 1]. Without scaling, the magnitude of the embedding vectors would be much smaller relative to the positional encodings, causing the model to rely too heavily on position information and too little on token identity. When weight tying is used, this same scaling factor compensates for the shared matrix being optimized for two different purposes.

Embedding dimension

The embedding dimension d controls how many floating-point numbers represent each input category. Choosing the right dimension balances expressiveness against computational cost and overfitting risk.

Common guidelines

Rule of thumb	Formula or range	Typical use case
Square root rule	d = sqrt(V)	General starting point
Fourth root rule (Google)	d = V^(1/4)	Text classification, tabular data
Powers of two	d = 32, 64, 128, 256, 512	GPU-optimized architectures
Industry convention for NLP	d = 50 to 300	Word embeddings (Word2Vec, GloVe)
Large language models	d = 768 to 12,288	BERT, GPT families

Smaller dimensions compress information more aggressively and train faster, but may fail to capture nuanced relationships between items. Larger dimensions can represent subtler distinctions but require more data to avoid overfitting and consume more memory and compute.

Datasets with fewer than 100,000 sentences generally benefit from lower dimensions (50 to 100), while large-scale corpora support dimensions of 300 or higher. For large language models with billions of parameters, embedding dimensions of 4,096 or more are standard. GPT-3 uses an embedding dimension of 12,288, while BERT-base uses 768 and BERT-large uses 1,024.

Learnable embedding sizes

Recent research has explored automatically learning the optimal embedding dimension for different features, particularly in recommendation systems. Instead of assigning the same dimension to all categorical features, these methods allocate larger dimensions to features with more unique values or more complex relationships, and smaller dimensions to simpler features. This mixed-dimension approach can reduce model size while maintaining or improving accuracy.

Static vs. contextual embeddings

Embedding layers can produce either static or contextual representations. Understanding this distinction is important for selecting the right approach for a given task.

Static embeddings

Methods like Word2Vec, GloVe, and FastText assign each word a single, fixed vector regardless of context. The word "bank" receives the same vector whether it appears in "river bank" or "bank account," even though the meanings differ. Static embeddings are stored in a simple lookup table and are computationally cheap to retrieve. They work well as feature initializations for downstream models and for tasks where context sensitivity is less important.

Contextual embeddings

Models like ELMo, BERT, and GPT generate word representations dynamically based on the surrounding text. In these architectures, the initial embedding layer still performs a standard lookup, but the resulting vectors are then processed through multiple self-attention or recurrent layers that modify each vector based on the entire input sequence. The word "bank" ends up with different final representations in "river bank" vs. "bank account."

ELMo (Peters et al., 2018) was one of the first widely adopted contextual embedding models. It used a bidirectional LSTM language model and produced context-dependent vectors by combining representations from different LSTM layers. BERT (Devlin et al., 2018) went further by using a transformer encoder trained with a masked language modeling objective, allowing it to incorporate context from both directions simultaneously.

Research by Ethayarajh (2019) showed that less than 5% of the variance in a word's contextual representations (from BERT or GPT-2) can be explained by a static embedding. This finding confirmed that contextual models capture substantially more information about word usage than static approaches.

Property	Static embeddings	Contextual embeddings
Vector per word	One fixed vector	Different vector per context
Polysemy handling	Cannot disambiguate	Naturally disambiguates
Computation cost	Single lookup	Full forward pass through network
Storage	One matrix (V x d)	Model weights (millions to billions of parameters)
Examples	Word2Vec, GloVe, FastText	ELMo, BERT, GPT
Typical use	Feature initialization, similarity search	End-to-end fine-tuned models

Trainable vs. frozen embeddings

Embedding layer weights can be either trainable (updated during training via backpropagation) or frozen (held fixed throughout training).

Trainable embeddings are initialized randomly and learned from scratch alongside the rest of the model. This is appropriate when working with a large, task-specific dataset, or when the vocabulary is specialized (for example, molecular structures or game moves) and no suitable pre-trained embeddings exist.

Frozen embeddings use pre-trained vectors that are not updated during training. Freezing is useful for small training datasets to prevent overfitting, and it reduces the number of trainable parameters, speeding up training and lowering memory requirements.

A common hybrid strategy involves two phases. First, freeze the embedding layer and train only the upper layers of the model, so that the randomly initialized classifier head does not corrupt the pre-trained embeddings with large, noisy gradients. Second, unfreeze the embedding layer and fine-tune the entire model with a low learning rate. This transfer learning approach often outperforms either fully frozen or fully trainable embeddings.

Semi-frozen embeddings freeze vectors for words that exist in the pre-trained vocabulary while leaving out-of-vocabulary words trainable. This enables learning representations for new terms without disturbing the established embeddings.

Pre-trained embeddings

Several widely used pre-trained embedding sets can be loaded into embedding layers for strong initialization. Using pre-trained embeddings is a form of transfer learning that encodes knowledge from massive corpora, improving performance and convergence speed, especially with limited task-specific data.

Method	Developer	Training approach	Key feature
Word2Vec	Google	Skip-gram or CBOW on Google News (~100B words)	Learns from local context windows
GloVe	Stanford NLP	Global co-occurrence matrix factorization	Combines global statistics with local context
FastText	Meta AI (FAIR)	Skip-gram with character n-grams	Handles out-of-vocabulary words via subword information
ELMo	Allen AI	Bidirectional LSTM language model	Context-dependent (dynamic) embeddings

To use pre-trained embeddings, initialize the embedding layer's weight matrix with the pre-trained vectors. For vocabulary words not present in the pre-trained set, the corresponding rows are typically initialized randomly, matching the variance of the pre-trained vectors to avoid scale mismatches. In PyTorch, nn.Embedding.from_pretrained() handles this process. In TensorFlow/Keras, pass an embeddings_initializer or set weights directly with set_weights().

Embedding layers in different contexts

Natural language processing

In NLP, the embedding layer is typically the first component in the model. It converts sequences of token IDs (produced by a tokenizer) into sequences of dense vectors. In transformer models like BERT and GPT, the embedding layer output is combined with positional encodings before being fed into self-attention layers.

Modern language models use subword tokenization schemes (such as byte pair encoding, WordPiece, or SentencePiece) that break words into smaller units. The embedding layer then maps each subword token to a vector. This approach limits vocabulary size to a manageable range (typically 30,000 to 50,000 tokens for monolingual English models) while ensuring that any input text can be tokenized without encountering truly unknown tokens. GPT-2 uses byte-level BPE with a vocabulary of 50,257 tokens, while BERT uses WordPiece with 30,522 tokens.

Recommendation systems

In collaborative filtering and neural recommendation models, separate embedding layers map user IDs and item IDs to dense vectors. The predicted relevance of an item for a user is often computed as the dot product (or cosine similarity) of the user and item embeddings. This approach originates from matrix factorization and is now central to deep recommendation architectures such as Neural Collaborative Filtering (NCF), Wide and Deep, and DLRM (Deep Learning Recommendation Model).

Learning user and item embeddings jointly captures latent preferences: users with similar taste vectors cluster together in the embedding space, as do items with similar characteristics. The embedding tables in large-scale recommendation models can be enormous; for a service with hundreds of millions of users and items, the embedding tables may contain billions of parameters and require specialized distributed storage.

Categorical features in tabular data

For tabular datasets with high-cardinality categorical columns (such as zip codes, product IDs, or store identifiers), embedding layers replace traditional one-hot encoding. Each categorical feature receives its own small embedding layer, and the resulting vectors are concatenated with continuous features before being passed to dense layers.

This entity embeddings technique was demonstrated by Guo and Berkhahn (2016), who used it to achieve third place in the Kaggle Rossmann Store Sales competition. They showed that it captures richer relationships between categories than one-hot encoding. For example, embedding zip codes can learn to place geographically nearby or socioeconomically similar regions close together in the vector space, without any explicit geographic information. The learned embeddings can also be extracted and used as input features for other machine learning models like gradient boosted trees, often improving their performance as well.

Multimodal learning

In multimodal models like CLIP (Contrastive Language-Image Pre-training), separate embedding layers and encoders process different modalities (images and text) into a shared embedding space. A vision encoder (such as a Vision Transformer or ResNet) produces image embeddings, while a text encoder (a transformer-based model) produces text embeddings. Learned projection matrices map both sets of features into a common vector space, typically of 512 dimensions. The model is trained with a contrastive objective that maximizes cosine similarity between matching image-text pairs while minimizing similarity between non-matching pairs. This joint embedding space enables zero-shot image classification, cross-modal search, and other tasks that bridge vision and language.

Graph neural networks

In graph neural networks, embedding layers produce initial node representations that are refined through message-passing operations. Each node receives an initial embedding (either from a lookup table or from input features), and through multiple rounds of neighborhood aggregation, these embeddings incorporate information about the node's local and global graph structure. In knowledge graphs, embedding methods like TransE, DistMult, and RotatE learn vector representations for both entities and relations, enabling tasks like link prediction and knowledge base completion.

Dense retrieval and vector databases

Embeddings produced by embedding layers (or full encoder models) form the basis of dense retrieval systems. Documents and queries are encoded into dense vectors using models like BERT-based bi-encoders, and retrieval is performed via approximate nearest neighbor search using cosine similarity or dot product distance. This approach powers modern semantic search systems and retrieval-augmented generation (RAG) pipelines. Vector databases such as Pinecone, Weaviate, Milvus, and FAISS are specifically designed to store and efficiently search through large collections of embedding vectors. Hybrid retrieval systems combine dense embeddings (for semantic understanding) with sparse representations like BM25 or TF-IDF (for exact keyword matching) to achieve the best of both approaches.

Weight tying (shared embeddings)

In language models, the input embedding layer and the output projection layer (which predicts the next token) both have shape (V, d). Weight tying shares the same weight matrix between these two layers. The input embedding maps tokens to vectors ("what does this token mean when read?"), while the output projection maps hidden states back to vocabulary logits ("what token should be produced?"). Both concern the semantic identity of tokens, making weight sharing a reasonable inductive bias.

Weight tying was introduced by Inan et al. (2016) and Press and Wolf (2017) and became standard practice in GPT-2 and many subsequent transformers. The benefits include:

Reduced parameter count. For a 50,000-token vocabulary with 768-dimensional embeddings, the embedding matrix has approximately 38 million parameters. Sharing this matrix between input and output roughly halves the embedding-related parameter count.
Improved generalization. Tying forces a single, consistent representation for each token, which has been shown to lower perplexity in language modeling and improve quality in machine translation.
Faster training. Fewer parameters mean fewer gradient computations and lower memory usage.

Recent research (2025) has identified a trade-off: tied embeddings can become biased toward the output (prediction) task because output gradients tend to dominate during early training, potentially reducing the effectiveness of input representations in early layers. Press and Wolf (2017) also showed that the tied embedding evolves in a manner more similar to the output embedding than to the input embedding in an untied model. As a result, some newer large language models (such as Llama 2 and Llama 3) have moved away from weight tying, using separate input and output embedding matrices despite the increased parameter count.

Padding and special tokens

Embedding layers often need to handle padding tokens and other special tokens that carry no semantic meaning but serve structural purposes in batched computation.

Padding index

In PyTorch, the padding_idx parameter designates a specific index (typically 0) whose embedding vector is fixed at all zeros and does not receive gradient updates during training. This is necessary when processing variable-length sequences that are padded to a uniform length within a batch. The padding token's embedding remains a zero vector throughout training, ensuring it does not contribute meaningfully to downstream computations like attention or pooling.

In Keras, the equivalent mask_zero=True parameter tells subsequent layers (particularly recurrent layers) to ignore positions where the input is 0. When this flag is set, the vocabulary's input_dim must be incremented by 1 to account for the reserved zero index.

Out-of-vocabulary tokens

When a word or item is not present in the embedding layer's vocabulary, it is classified as out-of-vocabulary (OOV). Common handling strategies include:

UNK token: Assign all OOV words to a single "unknown" token embedding. This is simple but loses all information about the specific unknown word.
Subword tokenization: Methods like byte pair encoding (BPE) and WordPiece decompose words into known subword units, effectively eliminating the OOV problem. This is the approach used by most modern language models.
Character n-grams: FastText computes embeddings for OOV words by summing the embeddings of their character n-grams, producing reasonable vectors even for words never seen during training.
Hash embeddings: Some systems hash OOV tokens into a fixed number of buckets, each with its own embedding. This reduces memory while providing distinct (though potentially colliding) representations.

Positional embeddings in transformers

Transformer models process input as unordered sets and lack the built-in sequential awareness of recurrent neural networks. To encode position information, transformers add positional encodings to the token embeddings before the attention layers. Several approaches exist.

Sinusoidal positional encoding

The original transformer (Vaswani et al., 2017) used fixed sinusoidal functions of different frequencies to encode positions. Each dimension of the positional encoding vector uses a sine or cosine function with a specific frequency, creating a unique pattern for each position. This approach requires no additional learned parameters and can theoretically generalize to sequence lengths not seen during training.

Learned positional embeddings

Models like BERT and GPT-2 use a second embedding layer that maps position indices (0, 1, 2, ..., max_length - 1) to vectors of the same dimension as the token embeddings. These positional embeddings are learned during training rather than fixed, and are added element-wise to the token embeddings. The drawback is that learned positional embeddings cannot handle sequences longer than the maximum length seen during training without additional modifications.

Rotary position embedding (RoPE)

Introduced by Su et al. (2021), RoPE encodes position by rotating query and key vectors in paired dimensions by an angle proportional to their absolute positions. After rotation, the dot product between a query and key naturally encodes only the relative distance between the two tokens, without adding any learnable parameters. RoPE has become the default positional strategy in many modern language models, including Llama 2, Llama 3, Gemma, Mistral, and Qwen.

RoPE is parameter-free and inherently relative, and it scales gracefully from short to long contexts. Extensions like NTK-aware scaling and YaRN allow models to generalize to sequence lengths substantially longer than those seen during training. Recent work (2025) has explored generalizing RoPE to higher-dimensional spaces using Lie algebraic formulations for applications in video, spatial data, and spherical coordinates.

Implementation

PyTorch: `torch.nn.Embedding`

PyTorch provides the nn.Embedding module with this constructor signature:

torch.nn.Embedding(
    num_embeddings,        # vocabulary size (V)
    embedding_dim,         # embedding vector dimension (d)
    padding_idx=None,      # index whose embedding stays zero
    max_norm=None,         # renormalize embeddings exceeding this norm
    norm_type=2.0,         # norm type for max_norm
    scale_grad_by_freq=False,  # scale gradients by inverse frequency
    sparse=False           # use sparse gradient updates
)

Basic usage:

import torch
import torch.nn as nn

# Create embedding layer: 10,000 tokens, 256-dimensional vectors
emb = nn.Embedding(num_embeddings=10000, embedding_dim=256)

# Look up embeddings for a batch of token IDs
input_ids = torch.tensor([5, 42, 7, 103])
vectors = emb(input_ids)  # shape: (4, 256)

Loading pre-trained weights:

# Assume pretrained_weights is a FloatTensor of shape (V, d)
emb = nn.Embedding.from_pretrained(pretrained_weights, freeze=True)

Setting freeze=True makes the weights non-trainable, which is equivalent to emb.weight.requires_grad = False.

TensorFlow / Keras: `tf.keras.layers.Embedding`

In TensorFlow, the equivalent layer is:

tf.keras.layers.Embedding(
    input_dim,                    # vocabulary size (V)
    output_dim,                   # embedding dimension (d)
    embeddings_initializer='uniform',
    embeddings_regularizer=None,
    embeddings_constraint=None,
    mask_zero=False,
    input_length=None
)

Basic usage:

import tensorflow as tf

# Create embedding layer
emb = tf.keras.layers.Embedding(input_dim=10000, output_dim=256)

# Look up embeddings
input_ids = tf.constant([5, 42, 7, 103])
vectors = emb(input_ids)  # shape: (4, 256)

To freeze the layer, set emb.trainable = False. To load pre-trained weights, use emb.set_weights([pretrained_matrix]) or pass a custom initializer.

Framework comparison

Feature	PyTorch (`nn.Embedding`)	TensorFlow (`tf.keras.layers.Embedding`)
Padding support	`padding_idx` parameter	`mask_zero=True`
Norm constraint	`max_norm` parameter	`embeddings_constraint`
Sparse gradients	`sparse=True`	Not built-in (use `tf.IndexedSlices`)
Pre-trained loading	`from_pretrained()` class method	`set_weights()` or custom initializer
Freeze weights	`freeze` parameter or `requires_grad`	`trainable = False`

Embedding initialization

How the embedding weight matrix is initialized before training affects convergence speed and final model performance.

Random initialization is the default in most frameworks. PyTorch initializes nn.Embedding weights from a standard normal distribution N(0, 1). Keras uses a uniform distribution by default. Random initialization works well for most tasks when training from scratch with sufficient data.

Pre-trained initialization loads vectors from Word2Vec, GloVe, FastText, or another source. For tokens not covered in the pre-trained set, the corresponding rows are typically initialized randomly with variance matching the pre-trained vectors to avoid scale mismatches. Research on transformer models has shown that standardizing pre-trained embeddings to a range consistent with Xavier initialization can improve downstream task performance.

Xavier (Glorot) initialization samples weights to preserve activation variance across layers. Originally designed for layers with symmetric activations like sigmoid or tanh, it is sometimes applied to embedding layers in transformer architectures.

Scaled initialization appears in some large language models. GPT-2 initializes embeddings from a normal distribution with standard deviation 0.02. Some models scale by 1/sqrt(d), where d is the embedding dimension. The Hugging Face Transformers library typically uses an initializer_range configuration parameter (often defaulting to 0.02) for this purpose. These choices help stabilize training in very deep networks by keeping initial activations in a reasonable range.

In practice, initialization matters most when training data is limited. With large datasets and many training epochs, models tend to learn effective embeddings regardless of starting points. But for small datasets or scenarios using pre-trained embeddings, careful initialization combined with frozen or slowly fine-tuned embedding layers can significantly impact performance.

Explain like I'm 5 (ELI5)

Imagine you have a big book of stickers, where each page has a number. When someone tells you a number, you open to that page and pull out the sticker. Each sticker has special colors and shapes that tell you something about what that number represents.

An embedding layer works in a similar way. It has a big chart (called the weight matrix), and each row in that chart is a list of numbers (a vector). When the computer receives a word or item represented by a number, the embedding layer looks up that row in the chart and hands back the list of numbers. Those numbers carry information about what the word means and how it relates to other words.

The interesting part is that the computer learns what numbers to put in each row by practicing on lots of examples. Over time, words that mean similar things end up with similar rows in the chart. So "happy" and "joyful" get stickers that look almost the same, while "happy" and "refrigerator" get very different stickers.

References

Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). "A Neural Probabilistic Language Model." *Journal of Machine Learning Research, 3*, 1137-1155. https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). "Efficient Estimation of Word Representations in Vector Space." *arXiv:1301.3781*. https://arxiv.org/abs/1301.3781
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). "Distributed Representations of Words and Phrases and their Compositionality." *Advances in Neural Information Processing Systems 26*. https://proceedings.neurips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf
Pennington, J., Socher, R., & Manning, C. D. (2014). "GloVe: Global Vectors for Word Representation." *Proceedings of EMNLP*. https://nlp.stanford.edu/pubs/glove.pdf
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). "Enriching Word Vectors with Subword Information." *Transactions of the ACL, 5*, 135-146. https://arxiv.org/abs/1607.04606
Guo, C. & Berkhahn, F. (2016). "Entity Embeddings of Categorical Variables." *arXiv:1604.06737*. https://arxiv.org/abs/1604.06737
Press, O. & Wolf, L. (2017). "Using the Output Embedding to Improve Language Models." *Proceedings of EACL*. https://arxiv.org/abs/1608.05859
Inan, H., Khosravi, K., & Socher, R. (2016). "Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling." *arXiv:1611.01462*. https://arxiv.org/abs/1611.01462
Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). "Attention Is All You Need." *Advances in Neural Information Processing Systems 30*. https://arxiv.org/abs/1706.03762
Ethayarajh, K. (2019). "How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings." *Proceedings of EMNLP-IJCNLP*. https://arxiv.org/abs/1909.00512
Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., & Liu, Y. (2021). "RoFormer: Enhanced Transformer with Rotary Position Embedding." *arXiv:2104.09864*. https://arxiv.org/abs/2104.09864
Koren, Y., Bell, R., & Volinsky, C. (2009). "Matrix Factorization Techniques for Recommender Systems." *IEEE Computer, 42(8)*, 30-37. https://ieeexplore.ieee.org/document/5197422
Radford, A., Wu, J., Child, R., et al. (2019). "Language Models are Unsupervised Multitask Learners." *OpenAI Technical Report*. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
PyTorch Documentation. "torch.nn.Embedding." https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html
TensorFlow Documentation. "tf.keras.layers.Embedding." https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding

Historical background

Bengio's neural probabilistic language model (2003)

Word2Vec (2013)

GloVe (2014)

FastText (2017)

Timeline of embedding methods

How an embedding layer works

Equivalence to one-hot encoding plus a linear layer

Gradient updates and training

Embedding scaling in transformers

Embedding dimension

Common guidelines

Learnable embedding sizes

Static vs. contextual embeddings

Static embeddings

Contextual embeddings

Trainable vs. frozen embeddings

Pre-trained embeddings

Embedding layers in different contexts

Natural language processing

Recommendation systems

Categorical features in tabular data

Multimodal learning

Graph neural networks

Dense retrieval and vector databases

Weight tying (shared embeddings)

Padding and special tokens

Padding index

Out-of-vocabulary tokens

Positional embeddings in transformers

Sinusoidal positional encoding

Learned positional embeddings

Rotary position embedding (RoPE)

Implementation

PyTorch: torch.nn.Embedding

TensorFlow / Keras: tf.keras.layers.Embedding

Framework comparison

Embedding initialization

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

Multi-head Latent Attention

ARC-AGI 2

GELU (Gaussian Error Linear Unit)

Agentic Context Engineering

Claude Sonnet 4.5

Computer-use agent

Historical background

Bengio's neural probabilistic language model (2003)

Word2Vec (2013)

GloVe (2014)

FastText (2017)

Timeline of embedding methods

How an embedding layer works

Equivalence to one-hot encoding plus a linear layer

Gradient updates and training

Embedding scaling in transformers

Embedding dimension

Common guidelines

Learnable embedding sizes

Static vs. contextual embeddings

Static embeddings

Contextual embeddings

Trainable vs. frozen embeddings

Pre-trained embeddings

Embedding layers in different contexts

Natural language processing

Recommendation systems

Categorical features in tabular data

Multimodal learning

Graph neural networks

Dense retrieval and vector databases

Weight tying (shared embeddings)

Padding and special tokens

Padding index

Out-of-vocabulary tokens

Positional embeddings in transformers

Sinusoidal positional encoding

Learned positional embeddings

PyTorch: `torch.nn.Embedding`

TensorFlow / Keras: `tf.keras.layers.Embedding`

PyTorch: `torch.nn.Embedding`

TensorFlow / Keras: `tf.keras.layers.Embedding`