# Embedding Layer

> Source: https://aiwiki.ai/wiki/embedding_layer
> Updated: 2026-07-11
> Categories: Machine Learning, Natural Language Processing, Neural Networks
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

An **embedding layer** is a [neural network](/wiki/neural_network) component that acts as a trainable lookup table, mapping discrete integer indices (such as word IDs, user IDs, or category codes) to dense, continuous-valued vectors. Rather than using sparse, high-dimensional [one-hot encodings](/wiki/one-hot_encoding), the embedding layer stores a weight matrix of shape *(vocabulary_size, embedding_dimension)* and retrieves the appropriate row for each input index. It is the first layer of nearly every modern [large language model](/wiki/large_language_model) and recommender system: in GPT-3, for example, the token embedding layer alone holds roughly 617 million parameters (a 50,257 x 12,288 matrix).[16] The lookup operation is mathematically equivalent to multiplying a one-hot vector by a weight matrix, but the lookup-based implementation avoids the wasteful zero multiplications that come with large vocabularies.

Embedding layers are foundational in modern [deep learning](/wiki/deep_learning) systems. They appear in [natural language processing](/wiki/natural_language_understanding) models (converting tokens to vectors), [recommendation systems](/wiki/recommender_system) (representing users and items), and tabular data pipelines (encoding high-cardinality categorical features). The concept of learned distributed representations was introduced by Bengio et al. in their 2003 neural probabilistic [language model](/wiki/language_model)[1] and later popularized by standalone word [embedding](/wiki/word_embedding) methods such as Word2Vec[2] and GloVe.[4]

## What problem does an embedding layer solve?

The embedding layer exists to defeat what Bengio et al. (2003) called the "curse of dimensionality" in language modeling: with a vocabulary of 100,000 words, modeling a joint distribution over 10 consecutive words involves on the order of 100,000^10 possible sequences, almost none of which appear in any training corpus.[1] One-hot encoding makes this worse by assigning every word an orthogonal vector, so the representation carries no notion of similarity (the vectors for "cat" and "dog" are exactly as far apart as "cat" and "refrigerator"). The embedding layer replaces these sparse, equidistant vectors with a small dense vector per item, learned so that similar items land near each other in vector space. Bengio et al. proposed "to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences."[1]

## Historical background

The idea of representing words as dense vectors has roots stretching back to the 1980s, when Hinton (1986) introduced the concept of "distributed representations." However, the modern embedding layer traces its lineage through several key milestones.

### Bengio's neural probabilistic language model (2003)

Yoshua Bengio, Rejean Ducharme, Pascal Vincent, and Christian Jauvin published "A Neural Probabilistic Language Model" in the Journal of Machine Learning Research in 2003.[1] This paper introduced the idea of learning word embeddings jointly with a language modeling objective. The model used a matrix $$C$$ of dimensions $$\lvert V \rvert \times m$$, where each row mapped a word to a real-valued vector in $$\mathbb{R}^m$$. These word vectors were concatenated and fed through a feedforward neural network with a tanh hidden layer, and a [softmax](/wiki/softmax) output layer predicted the next word. The probability function took the form $$y = \mathrm{softmax}(b + Wx + U \tanh(d + Hx))$$, where $$x$$ was the concatenation of the context word embeddings. The model learned both the embedding matrix C and the network parameters simultaneously through [backpropagation](/wiki/backpropagation).[1] This paper laid the groundwork for all subsequent word embedding research.

### Word2Vec (2013)

Tomas Mikolov and colleagues at Google published "Efficient Estimation of Word Representations in Vector Space" in 2013, introducing two lightweight architectures that could train on billions of words in hours rather than weeks.[2]

- **Skip-gram** predicts surrounding context words given a center word. For a center word $$w_c$$, the model computes $$P(w_o \mid w_c) = \frac{\exp(u_o^\top v_c)}{\sum_i \exp(u_i^\top v_c)}$$, where $$v_c$$ is the center word vector and $$u_o$$ is the context word vector. Each word maintains two separate vectors: one used when the word appears as a center word, and another when it appears as context.
- **Continuous bag-of-words (CBOW)** predicts a center word from the average of its surrounding context word vectors. It computes $$P(w_c \mid W_o)$$ using the averaged context vector across the $$2m$$ surrounding words.

Both models used training optimizations to scale to large vocabularies. **Negative sampling** pairs each positive training example with k randomly sampled "negative" words and updates only those embeddings, reducing computation from $$O(V)$$ to $$O(k)$$ per step. **Hierarchical softmax** uses a binary Huffman tree to represent the output layer, evaluating only $$\ln(V)$$ nodes instead of all $$V$$.[3]

A follow-up paper by Mikolov et al. (2013b), "Distributed Representations of Words and Phrases and their Compositionality," extended the approach to learn embeddings for common phrases and further refined the negative sampling objective.[3]

### GloVe (2014)

Jeffrey Pennington, Richard Socher, and Christopher Manning at Stanford developed GloVe (Global Vectors for Word Representation), which combined the strengths of global matrix factorization methods and local context window methods. GloVe first builds a global word-word co-occurrence matrix from the entire corpus in a single pass, then trains embeddings by minimizing a weighted least-squares objective that fits the dot product of word vectors to the logarithm of their co-occurrence count.[4] The weighting function down-weights very frequent co-occurrences to prevent common word pairs from dominating the objective. GloVe often trains faster on large corpora because the expensive co-occurrence counting happens only once upfront.

### FastText (2017)

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov at Facebook AI Research introduced FastText, which extended the skip-gram model by representing each word as a bag of character n-grams (typically 3 to 6 characters long).[5] The embedding for a word is computed as the sum of its constituent n-gram embeddings. For example, the word "running" is decomposed into character substrings like "<ru", "run", "unn", "nni", "nin", "ing", "ng>" (with boundary markers), and its vector is the sum of these n-gram vectors. This approach handles out-of-vocabulary words (by summing their n-gram vectors), captures morphological patterns (words sharing suffixes or prefixes get similar embeddings), and produces better representations for rare words.[5]

### Timeline of embedding methods

| Method | Year | Developer | Training approach | Handles OOV words | Embedding type |
|---|---|---|---|---|---|
| Bengio NPLM | 2003 | Universite de Montreal | Feedforward network with [softmax](/wiki/softmax) | No | Static |
| Word2Vec | 2013 | Google | Skip-gram or CBOW with negative sampling | No | Static |
| GloVe | 2014 | Stanford NLP | Global co-occurrence matrix factorization | No | Static |
| FastText | 2017 | Meta AI (FAIR) | Skip-gram with character n-grams | Yes | Static |
| ELMo | 2018 | Allen AI | Bidirectional [LSTM](/wiki/long_short-term_memory_lstm) language model | No | Contextual |
| [BERT](/wiki/bert_bidirectional_encoder_representations_from_transformers) | 2018 | Google | Masked language model with [transformer](/wiki/transformer) | No (uses WordPiece subwords) | Contextual |

## How does an embedding layer work?

An embedding layer maintains a two-dimensional weight matrix **W** with shape $$(V, d)$$, where $$V$$ is the vocabulary size (the number of distinct items) and $$d$$ is the embedding dimension (the number of values in each vector). When receiving an integer input *i*, the layer returns the *i*-th row of **W**. For a batch of inputs, it returns the corresponding stacked rows.

The forward pass is a pure index-based lookup with no matrix multiplication, making it computationally inexpensive. Given a sequence of [token](/wiki/token) IDs [3, 17, 42], the embedding layer simply fetches rows 3, 17, and 42 from the weight matrix and returns them as a tensor of shape (3, d).

### How does an embedding layer differ from one-hot encoding?

Mathematically, the embedding lookup for index *i* is equivalent to computing $$x_{\text{onehot}} \cdot W$$, where $$x_{\text{onehot}}$$ is a one-hot vector with a 1 at position *i* and zeros elsewhere. Since multiplying a one-hot vector by a matrix simply selects one row, the result is identical. However, the embedding layer is far more efficient in practice because it avoids constructing the sparse one-hot vector and performing the full matrix multiplication. One-hot encoding a vocabulary of 100,000 tokens would require creating vectors with 100,000 elements (most of which are zeros) and then performing 99,999 multiplications by zero per input. The embedding layer skips all of this by going directly to the relevant row.

| Aspect | One-hot + linear layer | Embedding layer (lookup) |
|---|---|---|
| Input representation | Sparse vector of size *V* | Single integer index |
| Forward pass operation | Full matrix multiplication | Row index lookup |
| Memory for input | O(V) per token | O(1) per token |
| Gradient computation | Dense gradient over full matrix | Sparse gradient (only accessed rows) |
| Bias term | Can include bias | Typically no bias |
| Mathematical result | Identical | Identical |

### Gradient updates and training

During [backpropagation](/wiki/backpropagation), only the rows of the embedding matrix that were accessed in the current batch receive non-zero gradients. This sparse gradient property provides a significant efficiency advantage. If a vocabulary contains 50,000 entries but only 128 unique tokens appear in a given batch, only 128 rows of the embedding matrix are updated in that training step. The remaining 49,872 rows are unchanged.

Frameworks like PyTorch support explicitly sparse gradient tensors (via the `sparse=True` parameter) that store only the non-zero entries, further reducing memory consumption during training.[14] The [optimizer](/wiki/optimizer) then applies updates only to the accessed rows, using algorithms like [stochastic gradient descent](/wiki/stochastic_gradient_descent_sgd) or Adam.

### Embedding scaling in transformers

In the original [transformer](/wiki/transformer) architecture (Vaswani et al., 2017), the embedding vectors are multiplied by the square root of the embedding dimension ($$\sqrt{d_{\text{model}}}$$) before being added to positional encodings.[9] This scaling factor prevents the token embeddings from being dwarfed by the positional encoding values, which use sine and cosine functions with values in the range $$[-1, 1]$$. Without scaling, the magnitude of the embedding vectors would be much smaller relative to the positional encodings, causing the model to rely too heavily on position information and too little on token identity. When weight tying is used, this same scaling factor compensates for the shared matrix being optimized for two different purposes.

## How do you choose the embedding dimension?

The embedding dimension *d* controls how many floating-point numbers represent each input category. Choosing the right dimension balances expressiveness against computational cost and [overfitting](/wiki/overfitting) risk.

### Common guidelines

| Rule of thumb | Formula or range | Typical use case |
|---|---|---|
| Square root rule | $$d = \sqrt{V}$$ | General starting point |
| Fourth root rule (Google) | $$d = V^{1/4}$$ | Text classification, tabular data |
| Powers of two | d = 32, 64, 128, 256, 512 | GPU-optimized architectures |
| Industry convention for NLP | d = 50 to 300 | Word embeddings (Word2Vec, GloVe) |
| [Large language models](/wiki/large_language_model) | d = 768 to 12,288 | [BERT](/wiki/bert_bidirectional_encoder_representations_from_transformers), [GPT](/wiki/gpt_generative_pre-trained_transformer) families |

Smaller dimensions compress information more aggressively and train faster, but may fail to capture nuanced relationships between items. Larger dimensions can represent subtler distinctions but require more data to avoid overfitting and consume more memory and compute.

Datasets with fewer than 100,000 sentences generally benefit from lower dimensions (50 to 100), while large-scale corpora support dimensions of 300 or higher. For large language models with billions of parameters, embedding dimensions of 4,096 or more are standard. GPT-3 uses an embedding dimension of 12,288, while BERT-base uses 768 and BERT-large uses 1,024.[16]

### Learnable embedding sizes

Recent research has explored automatically learning the optimal embedding dimension for different features, particularly in [recommendation systems](/wiki/recommender_system). Instead of assigning the same dimension to all categorical features, these methods allocate larger dimensions to features with more unique values or more complex relationships, and smaller dimensions to simpler features. This mixed-dimension approach can reduce model size while maintaining or improving accuracy.

## What is the difference between static and contextual embeddings?

Embedding layers can produce either static or contextual representations. Understanding this distinction is important for selecting the right approach for a given task.

### Static embeddings

Methods like Word2Vec, GloVe, and FastText assign each word a single, fixed vector regardless of context. The word "bank" receives the same vector whether it appears in "river bank" or "bank account," even though the meanings differ. Static embeddings are stored in a simple lookup table and are computationally cheap to retrieve. They work well as feature initializations for downstream models and for tasks where context sensitivity is less important.

### Contextual embeddings

Models like ELMo, [BERT](/wiki/bert_bidirectional_encoder_representations_from_transformers), and [GPT](/wiki/gpt_generative_pre-trained_transformer) generate word representations dynamically based on the surrounding text. In these architectures, the initial embedding layer still performs a standard lookup, but the resulting vectors are then processed through multiple [self-attention](/wiki/self-attention_also_called_self-attention_layer) or recurrent layers that modify each vector based on the entire input sequence. The word "bank" ends up with different final representations in "river bank" vs. "bank account."

ELMo (Peters et al., 2018) was one of the first widely adopted contextual embedding models. It used a bidirectional [LSTM](/wiki/long_short-term_memory_lstm) language model and produced context-dependent vectors by combining representations from different LSTM layers. BERT (Devlin et al., 2018) went further by using a [transformer](/wiki/transformer) encoder trained with a masked language modeling objective, allowing it to incorporate context from both directions simultaneously.

Research by Ethayarajh (2019) showed that less than 5% of the variance in a word's contextual representations (from BERT or GPT-2) can be explained by a static embedding.[10] This finding confirmed that contextual models capture substantially more information about word usage than static approaches.

| Property | Static embeddings | Contextual embeddings |
|---|---|---|
| Vector per word | One fixed vector | Different vector per context |
| Polysemy handling | Cannot disambiguate | Naturally disambiguates |
| Computation cost | Single lookup | Full forward pass through network |
| Storage | One matrix ($$V \times d$$) | Model weights (millions to billions of parameters) |
| Examples | Word2Vec, GloVe, FastText | ELMo, BERT, GPT |
| Typical use | Feature initialization, similarity search | End-to-end fine-tuned models |

## Trainable vs. frozen embeddings

Embedding layer weights can be either **trainable** (updated during training via [backpropagation](/wiki/backpropagation)) or **frozen** (held fixed throughout training).

**Trainable embeddings** are initialized randomly and learned from scratch alongside the rest of the model. This is appropriate when working with a large, task-specific dataset, or when the vocabulary is specialized (for example, molecular structures or game moves) and no suitable pre-trained embeddings exist.

**Frozen embeddings** use pre-trained vectors that are not updated during training. Freezing is useful for small training datasets to prevent overfitting, and it reduces the number of trainable parameters, speeding up training and lowering memory requirements.

A common hybrid strategy involves two phases. First, freeze the embedding layer and train only the upper layers of the model, so that the randomly initialized classifier head does not corrupt the pre-trained embeddings with large, noisy gradients. Second, unfreeze the embedding layer and [fine-tune](/wiki/fine_tuning) the entire model with a low [learning rate](/wiki/learning_rate). This [transfer learning](/wiki/transfer_learning) approach often outperforms either fully frozen or fully trainable embeddings.

**Semi-frozen embeddings** freeze vectors for words that exist in the pre-trained vocabulary while leaving out-of-vocabulary words trainable. This enables learning representations for new terms without disturbing the established embeddings.

## Pre-trained embeddings

Several widely used pre-trained embedding sets can be loaded into embedding layers for strong initialization. Using pre-trained embeddings is a form of [transfer learning](/wiki/transfer_learning) that encodes knowledge from massive corpora, improving performance and convergence speed, especially with limited task-specific data.

| Method | Developer | Training approach | Key feature |
|---|---|---|---|
| Word2Vec | Google | Skip-gram or CBOW on Google News (~100B words) | Learns from local context windows |
| GloVe | Stanford NLP | Global co-occurrence matrix factorization | Combines global statistics with local context |
| FastText | Meta AI (FAIR) | Skip-gram with character n-grams | Handles out-of-vocabulary words via subword information |
| ELMo | Allen AI | Bidirectional [LSTM](/wiki/long_short-term_memory_lstm) language model | Context-dependent (dynamic) embeddings |

To use pre-trained embeddings, initialize the embedding layer's weight matrix with the pre-trained vectors. For vocabulary words not present in the pre-trained set, the corresponding rows are typically initialized randomly, matching the variance of the pre-trained vectors to avoid scale mismatches. In PyTorch, `nn.Embedding.from_pretrained()` handles this process.[14] In TensorFlow/Keras, pass an `embeddings_initializer` or set weights directly with `set_weights()`.[15]

## Where are embedding layers used?

### Natural language processing

In NLP, the embedding layer is typically the first component in the model. It converts sequences of [token](/wiki/token) IDs (produced by a tokenizer) into sequences of dense vectors. In [transformer](/wiki/transformer) models like [BERT](/wiki/bert_bidirectional_encoder_representations_from_transformers) and [GPT](/wiki/gpt_generative_pre-trained_transformer), the embedding layer output is combined with [positional encodings](/wiki/positional_encoding) before being fed into [self-attention](/wiki/self-attention_also_called_self-attention_layer) layers.

Modern language models use subword tokenization schemes (such as byte pair encoding, WordPiece, or SentencePiece) that break words into smaller units. The embedding layer then maps each subword token to a vector. This approach limits vocabulary size to a manageable range (typically 30,000 to 50,000 tokens for monolingual English models) while ensuring that any input text can be tokenized without encountering truly unknown tokens. GPT-2 uses byte-level BPE with a vocabulary of 50,257 tokens, while BERT uses WordPiece with 30,522 tokens.[13]

### Recommendation systems

In [collaborative filtering](/wiki/collaborative_filtering) and neural recommendation models, separate embedding layers map user IDs and item IDs to dense vectors. The predicted relevance of an item for a user is often computed as the dot product (or cosine similarity) of the user and item embeddings. This approach originates from [matrix factorization](/wiki/matrix_factorization)[12] and is now central to deep recommendation architectures such as Neural Collaborative Filtering (NCF), Wide and Deep, and DLRM (Deep Learning Recommendation Model).

Learning user and item embeddings jointly captures latent preferences: users with similar taste vectors cluster together in the embedding space, as do items with similar characteristics. The embedding tables in large-scale recommendation models can be enormous. In Meta's DLRM, the embedding tables hold the overwhelming majority of model parameters: the public MLPerf recommendation benchmark trains a DLRM on a terabyte of click-through data with roughly 100 GB of embedding memory and more than 25 billion parameters, so the tables are too large to fit on a single accelerator and must be sharded across devices.[17] The DLRM authors note that the model "contains up to billions of parameters, unlike other deep learning networks," which makes embedding-table storage and bandwidth the dominant engineering constraint.[17]

### Categorical features in tabular data

For tabular datasets with high-cardinality categorical columns (such as zip codes, product IDs, or store identifiers), embedding layers replace traditional one-hot encoding. Each categorical feature receives its own small embedding layer, and the resulting vectors are concatenated with continuous features before being passed to dense layers.

This **entity embeddings** technique was demonstrated by Guo and Berkhahn (2016), who used it to achieve third place in the Kaggle Rossmann Store Sales competition.[6] They showed that it captures richer relationships between categories than one-hot encoding.[6] For example, embedding zip codes can learn to place geographically nearby or socioeconomically similar regions close together in the vector space, without any explicit geographic information. The learned embeddings can also be extracted and used as input features for other [machine learning](/wiki/machine_learning) models like gradient boosted trees, often improving their performance as well.

### Multimodal learning

In multimodal models like CLIP (Contrastive Language-Image Pre-training), separate embedding layers and encoders process different modalities (images and text) into a shared embedding space. A vision encoder (such as a Vision Transformer or ResNet) produces image embeddings, while a text encoder (a [transformer](/wiki/transformer)-based model) produces text embeddings. Learned projection matrices map both sets of features into a common vector space, typically of 512 dimensions. The model is trained with a contrastive objective that maximizes cosine similarity between matching image-text pairs while minimizing similarity between non-matching pairs. This joint embedding space enables zero-shot image classification, cross-modal search, and other tasks that bridge vision and language.

### Graph neural networks

In graph neural networks, embedding layers produce initial node representations that are refined through message-passing operations. Each node receives an initial embedding (either from a lookup table or from input features), and through multiple rounds of neighborhood aggregation, these embeddings incorporate information about the node's local and global graph structure. In knowledge graphs, embedding methods like TransE, DistMult, and RotatE learn vector representations for both entities and relations, enabling tasks like link prediction and knowledge base completion.

### Dense retrieval and vector databases

Embeddings produced by embedding layers (or full encoder models) form the basis of dense retrieval systems. Documents and queries are encoded into dense vectors using models like BERT-based bi-encoders, and retrieval is performed via approximate nearest neighbor search using cosine similarity or dot product distance. This approach powers modern semantic search systems and retrieval-augmented generation (RAG) pipelines. Vector databases such as Pinecone, Weaviate, Milvus, and FAISS are specifically designed to store and efficiently search through large collections of embedding vectors. Hybrid retrieval systems combine dense embeddings (for semantic understanding) with sparse representations like BM25 or TF-IDF (for exact keyword matching) to achieve the best of both approaches.

## What is weight tying (shared embeddings)?

In [language models](/wiki/language_model), the input embedding layer and the output projection layer (which predicts the next token) both have shape *(V, d)*. **Weight tying** shares the same weight matrix between these two layers. The input embedding maps tokens to vectors ("what does this token mean when read?"), while the output projection maps hidden states back to vocabulary logits ("what token should be produced?"). Both concern the semantic identity of tokens, making weight sharing a reasonable inductive bias.

Weight tying was introduced by Inan et al. (2016)[8] and Press and Wolf (2017).[7] Press and Wolf studied the topmost weight matrix of a neural language model, showed that it "constitutes a valid word embedding," and recommended "tying the input embedding and this output embedding," reporting that doing so "can reduce the size of neural translation models to less than half of their original size without harming their performance."[7] The technique became standard practice in GPT-2[13] and many subsequent [transformers](/wiki/transformer). The benefits include:

- **Reduced parameter count.** For a 50,000-token vocabulary with 768-dimensional embeddings, the embedding matrix has approximately 38 million parameters. Sharing this matrix between input and output roughly halves the embedding-related parameter count.
- **Improved generalization.** Tying forces a single, consistent representation for each token, which has been shown to lower perplexity in language modeling and improve quality in machine translation.[7]
- **Faster training.** Fewer parameters mean fewer gradient computations and lower memory usage.

Recent research (2025) has identified a trade-off: tied embeddings can become biased toward the output (prediction) task because output gradients tend to dominate during early training, potentially reducing the effectiveness of input representations in early layers. Press and Wolf (2017) also showed that the tied embedding evolves in a manner more similar to the output embedding than to the input embedding in an untied model.[7] As a result, some newer large language models (such as Llama 2 and Llama 3) have moved away from weight tying, using separate input and output embedding matrices despite the increased parameter count.

## Padding and special tokens

Embedding layers often need to handle padding tokens and other special tokens that carry no semantic meaning but serve structural purposes in batched computation.

### Padding index

In PyTorch, the `padding_idx` parameter designates a specific index (typically 0) whose embedding vector is fixed at all zeros and does not receive gradient updates during training.[14] This is necessary when processing variable-length sequences that are padded to a uniform length within a batch. The padding token's embedding remains a zero vector throughout training, ensuring it does not contribute meaningfully to downstream computations like attention or pooling.

In Keras, the equivalent `mask_zero=True` parameter tells subsequent layers (particularly [recurrent](/wiki/recurrent_neural_network) layers) to ignore positions where the input is 0. When this flag is set, the vocabulary's `input_dim` must be incremented by 1 to account for the reserved zero index.[15]

### Out-of-vocabulary tokens

When a word or item is not present in the embedding layer's vocabulary, it is classified as out-of-vocabulary (OOV). Common handling strategies include:

- **UNK token:** Assign all OOV words to a single "unknown" token embedding. This is simple but loses all information about the specific unknown word.
- **Subword tokenization:** Methods like byte pair encoding (BPE) and WordPiece decompose words into known subword units, effectively eliminating the OOV problem. This is the approach used by most modern language models.
- **Character n-grams:** FastText computes embeddings for OOV words by summing the embeddings of their character n-grams, producing reasonable vectors even for words never seen during training.[5]
- **Hash embeddings:** Some systems hash OOV tokens into a fixed number of buckets, each with its own embedding. This reduces memory while providing distinct (though potentially colliding) representations.

## Positional embeddings in transformers

[Transformer](/wiki/transformer) models process input as unordered sets and lack the built-in sequential awareness of [recurrent neural networks](/wiki/recurrent_neural_network). To encode position information, transformers add positional encodings to the token embeddings before the attention layers. Several approaches exist.

### Sinusoidal positional encoding

The original transformer (Vaswani et al., 2017) used fixed sinusoidal functions of different frequencies to encode positions.[9] Each dimension of the positional encoding vector uses a sine or cosine function with a specific frequency, creating a unique pattern for each position. This approach requires no additional learned parameters and can theoretically generalize to sequence lengths not seen during training.

### Learned positional embeddings

Models like [BERT](/wiki/bert_bidirectional_encoder_representations_from_transformers) and GPT-2 use a second embedding layer that maps position indices (0, 1, 2, ..., max_length - 1) to vectors of the same dimension as the token embeddings. These [positional embeddings](/wiki/positional_encoding) are learned during training rather than fixed, and are added element-wise to the token embeddings. The drawback is that learned positional embeddings cannot handle sequences longer than the maximum length seen during training without additional modifications.

### Rotary position embedding (RoPE)

Introduced by Su et al. (2021), RoPE encodes position by rotating query and key vectors in paired dimensions by an angle proportional to their absolute positions.[11] After rotation, the dot product between a query and key naturally encodes only the relative distance between the two tokens, without adding any learnable parameters.[11] RoPE has become the default positional strategy in many modern language models, including Llama 2, Llama 3, Gemma, Mistral, and Qwen.

RoPE is parameter-free and inherently relative, and it scales gracefully from short to long contexts. Extensions like NTK-aware scaling and YaRN allow models to generalize to sequence lengths substantially longer than those seen during training. Recent work (2025) has explored generalizing RoPE to higher-dimensional spaces using Lie algebraic formulations for applications in video, spatial data, and spherical coordinates.

## Implementation

### PyTorch: `torch.nn.Embedding`

PyTorch provides the `nn.Embedding` module with this constructor signature:[14]

```python
torch.nn.Embedding(
    num_embeddings,        # vocabulary size (V)
    embedding_dim,         # embedding vector dimension (d)
    padding_idx=None,      # index whose embedding stays zero
    max_norm=None,         # renormalize embeddings exceeding this norm
    norm_type=2.0,         # norm type for max_norm
    scale_grad_by_freq=False,  # scale gradients by inverse frequency
    sparse=False           # use sparse gradient updates
)
```

Basic usage:

```python
import torch
import torch.nn as nn

# Create embedding layer: 10,000 tokens, 256-dimensional vectors
emb = nn.Embedding(num_embeddings=10000, embedding_dim=256)

# Look up embeddings for a batch of token IDs
input_ids = torch.tensor([5, 42, 7, 103])
vectors = emb(input_ids)  # shape: (4, 256)
```

Loading pre-trained weights:

```python
# Assume pretrained_weights is a FloatTensor of shape (V, d)
emb = nn.Embedding.from_pretrained(pretrained_weights, freeze=True)
```

Setting `freeze=True` makes the weights non-trainable, which is equivalent to `emb.weight.requires_grad = False`.

### TensorFlow / Keras: `tf.keras.layers.Embedding`

In TensorFlow, the equivalent layer is:[15]

```python
tf.keras.layers.Embedding(
    input_dim,                    # vocabulary size (V)
    output_dim,                   # embedding dimension (d)
    embeddings_initializer='uniform',
    embeddings_regularizer=None,
    embeddings_constraint=None,
    mask_zero=False,
    input_length=None
)
```

Basic usage:

```python
import tensorflow as tf

# Create embedding layer
emb = tf.keras.layers.Embedding(input_dim=10000, output_dim=256)

# Look up embeddings
input_ids = tf.constant([5, 42, 7, 103])
vectors = emb(input_ids)  # shape: (4, 256)
```

To freeze the layer, set `emb.trainable = False`. To load pre-trained weights, use `emb.set_weights([pretrained_matrix])` or pass a custom initializer.

### Framework comparison

| Feature | PyTorch (`nn.Embedding`) | TensorFlow (`tf.keras.layers.Embedding`) |
|---|---|---|
| Padding support | `padding_idx` parameter | `mask_zero=True` |
| Norm constraint | `max_norm` parameter | `embeddings_constraint` |
| Sparse gradients | `sparse=True` | Not built-in (use `tf.IndexedSlices`) |
| Pre-trained loading | `from_pretrained()` class method | `set_weights()` or custom initializer |
| Freeze weights | `freeze` parameter or `requires_grad` | `trainable = False` |

## How are embedding weights initialized?

How the embedding weight matrix is initialized before training affects convergence speed and final model performance.

**Random initialization** is the default in most frameworks. PyTorch initializes `nn.Embedding` weights from a standard normal distribution $$N(0, 1)$$.[14] Keras uses a uniform distribution by default.[15] Random initialization works well for most tasks when training from scratch with sufficient data.

**Pre-trained initialization** loads vectors from Word2Vec, GloVe, FastText, or another source. For tokens not covered in the pre-trained set, the corresponding rows are typically initialized randomly with variance matching the pre-trained vectors to avoid scale mismatches. Research on transformer models has shown that standardizing pre-trained embeddings to a range consistent with Xavier initialization can improve downstream task performance.

**Xavier (Glorot) initialization** samples weights to preserve activation variance across layers. Originally designed for layers with symmetric activations like sigmoid or tanh, it is sometimes applied to embedding layers in [transformer](/wiki/transformer) architectures.

**Scaled initialization** appears in some [large language models](/wiki/large_language_model). GPT-2 initializes embeddings from a normal distribution with standard deviation 0.02.[13] Some models scale by $$1/\sqrt{d}$$, where $$d$$ is the embedding dimension. The Hugging Face Transformers library typically uses an `initializer_range` configuration parameter (often defaulting to 0.02) for this purpose. These choices help stabilize training in very deep networks by keeping initial activations in a reasonable range.

In practice, initialization matters most when training data is limited. With large datasets and many training epochs, models tend to learn effective embeddings regardless of starting points. But for small datasets or scenarios using pre-trained embeddings, careful initialization combined with frozen or slowly [fine-tuned](/wiki/fine_tuning) embedding layers can significantly impact performance.

## Explain like I'm 5 (ELI5)

Imagine you have a big book of stickers, where each page has a number. When someone tells you a number, you open to that page and pull out the sticker. Each sticker has special colors and shapes that tell you something about what that number represents.

An embedding layer works in a similar way. It has a big chart (called the weight matrix), and each row in that chart is a list of numbers (a vector). When the computer receives a word or item represented by a number, the embedding layer looks up that row in the chart and hands back the list of numbers. Those numbers carry information about what the word means and how it relates to other words.

The interesting part is that the computer learns what numbers to put in each row by practicing on lots of examples. Over time, words that mean similar things end up with similar rows in the chart. So "happy" and "joyful" get stickers that look almost the same, while "happy" and "refrigerator" get very different stickers.

## References

1. Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). "A Neural Probabilistic Language Model." *Journal of Machine Learning Research, 3*, 1137-1155. https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf

2. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). "Efficient Estimation of Word Representations in Vector Space." *arXiv:1301.3781*. https://arxiv.org/abs/1301.3781

3. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). "Distributed Representations of Words and Phrases and their Compositionality." *Advances in Neural Information Processing Systems 26*. https://proceedings.neurips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf

4. Pennington, J., Socher, R., & Manning, C. D. (2014). "GloVe: Global Vectors for Word Representation." *Proceedings of EMNLP*. https://nlp.stanford.edu/pubs/glove.pdf

5. Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). "Enriching Word Vectors with Subword Information." *Transactions of the ACL, 5*, 135-146. https://arxiv.org/abs/1607.04606

6. Guo, C. & Berkhahn, F. (2016). "Entity Embeddings of Categorical Variables." *arXiv:1604.06737*. https://arxiv.org/abs/1604.06737

7. Press, O. & Wolf, L. (2017). "Using the Output Embedding to Improve Language Models." *Proceedings of EACL*. https://arxiv.org/abs/1608.05859

8. Inan, H., Khosravi, K., & Socher, R. (2016). "Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling." *arXiv:1611.01462*. https://arxiv.org/abs/1611.01462

9. Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). "Attention Is All You Need." *Advances in Neural Information Processing Systems 30*. https://arxiv.org/abs/1706.03762

10. Ethayarajh, K. (2019). "How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings." *Proceedings of EMNLP-IJCNLP*. https://arxiv.org/abs/1909.00512

11. Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., & Liu, Y. (2021). "RoFormer: Enhanced Transformer with Rotary Position Embedding." *arXiv:2104.09864*. https://arxiv.org/abs/2104.09864

12. Koren, Y., Bell, R., & Volinsky, C. (2009). "Matrix Factorization Techniques for Recommender Systems." *IEEE Computer, 42(8)*, 30-37. https://ieeexplore.ieee.org/document/5197422

13. Radford, A., Wu, J., Child, R., et al. (2019). "Language Models are Unsupervised Multitask Learners." *OpenAI Technical Report*. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

14. PyTorch Documentation. "torch.nn.Embedding." https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html

15. TensorFlow Documentation. "tf.keras.layers.Embedding." https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding

16. Brown, T., Mann, B., Ryder, N., et al. (2020). "Language Models are Few-Shot Learners." *Advances in Neural Information Processing Systems 33*. (GPT-3: d_model = 12,288, vocabulary = 50,257.) https://arxiv.org/abs/2005.14165

17. Naumov, M., Mudigere, D., Shi, H.-J. M., et al. (2019). "Deep Learning Recommendation Model for Personalization and Recommendation Systems." *arXiv:1906.00091*. https://arxiv.org/abs/1906.00091