Encoder

See also: Decoder, Transformer, Autoencoder

An encoder in machine learning is a component that transforms input data into a compressed, structured, or otherwise more useful representation, often called a latent representation, context vector, or embedding vector. Encoders are foundational building blocks across a wide range of architectures, including autoencoders, sequence-to-sequence models, and transformer-based systems. By learning to capture the most salient features of the input, encoders enable downstream tasks such as classification, generation, translation, retrieval, clustering, and anomaly detection.

The concept of encoding traces back to early work on dimensionality reduction and representation learning, but encoders became especially prominent with the rise of deep learning in the 2000s and 2010s. Today, encoder architectures underpin some of the most influential models in natural language processing, computer vision, speech, and multimodal AI. They power semantic search engines, retrieval systems, image classifiers, sentence similarity tools, and the perception components of multimodal models like CLIP.

This article covers the broad family of encoders: how they work, the major taxonomies, the leading model families (BERT, ViT, CLIP, MAE, DINOv2, wav2vec 2.0, Whisper, CodeBERT), the training objectives that shape them, and how encoders compare with decoders and encoder-decoder hybrids in modern AI systems.

How encoders work

At a high level, an encoder takes an input (such as a sentence, image, or audio clip) and maps it through a series of learned transformations into a fixed-size or variable-length representation. This representation is designed to preserve the information most relevant to the task at hand while discarding noise or redundancy.

For a simple single-layer encoder, the mathematical formulation is:

E(x) = σ(Wx + b)

where x is the input, W is a weight matrix, b is a bias vector, and σ is a nonlinear activation function. In practice, modern encoders consist of many such layers stacked together, forming deep neural network architectures capable of learning rich, hierarchical features.

The output of the encoder, often called the latent representation or hidden state, serves as the input to a downstream component such as a decoder, a classifier head, or a retrieval index. In a transformer encoder, the output is a sequence of contextualized vectors, one per input token, where each vector is influenced by every other token in the input through self-attention. In a convolutional image encoder, the output is typically a stack of feature maps that capture spatial structure at multiple scales.

Encoders are usually trained with a loss function that aligns their representation with some objective: reconstruction error for autoencoders, masked-token prediction for BERT-style models, contrastive alignment for CLIP and Sentence-BERT, or supervised classification loss for a labeled task. The choice of training objective is what gives a particular encoder its character, far more than the specific architecture.

A taxonomy of encoders

Encoders span a surprisingly diverse design space. The same name covers everything from a simple affine projection that maps integer IDs to vectors to a 22-billion-parameter vision transformer trained on a billion images. The table below organizes the major families along several useful axes.

Axis	Endpoints	Examples
Lossless vs lossy	Lossless: every input bit is preserved. Lossy: information is discarded by design	Lossless: positional encoding, one-hot encoding. Lossy: autoencoders, embeddings, MAE
Linear vs nonlinear	Linear: the encoder is a single matrix multiplication. Nonlinear: stacked nonlinear layers	Linear: PCA, classical word embeddings (lookup tables). Nonlinear: BERT, ViT, autoencoders
Symmetric vs asymmetric	Symmetric: encoder and decoder mirror each other. Asymmetric: one side is much smaller or absent	Symmetric: classic autoencoder, U-Net. Asymmetric: MAE (heavy encoder, light decoder), BERT (no decoder)
Deterministic vs probabilistic	Deterministic: each input maps to a single vector. Probabilistic: each input maps to a distribution	Deterministic: BERT, ViT, ResNet. Probabilistic: variational autoencoder
Single-modal vs multimodal	Trained on one modality vs jointly with another	Single: BERT (text), ViT (image). Multimodal: CLIP, SigLIP, ALIGN
Causal vs bidirectional attention	Each token can attend only to past tokens vs all tokens	Causal: GPT (this is technically a decoder). Bidirectional: BERT, ViT, MAE

Lossless encoders such as positional encoding sit at one extreme: they do not throw any information away, they simply repackage it (a token index becomes a sinusoidal vector that carries the same content in a form attention layers can use). At the other extreme, an autoencoder bottleneck is deliberately tiny so that reconstruction forces the network to keep only what matters. Most production encoders live somewhere in the middle.

Encoders in autoencoders

An autoencoder is an unsupervised neural network architecture consisting of two halves: an encoder and a decoder. The encoder compresses the input data from a high-dimensional space (X = R^m) into a lower-dimensional latent space (Z = R^n, where m > n). The decoder then attempts to reconstruct the original input from this compressed representation. Because the latent space is smaller than the input space, the encoder is forced to learn only the most important features, a process sometimes described as creating a "bottleneck."

The modern resurgence of autoencoders in deep learning is usually traced to Hinton and Salakhutdinov's 2006 Science paper, Reducing the Dimensionality of Data with Neural Networks. They showed that a deep autoencoder, pre-trained as a stack of restricted Boltzmann machines and then fine-tuned with backpropagation, could produce far better low-dimensional codes than principal component analysis (PCA). That paper is widely considered one of the catalysts for the broader deep learning revival of the late 2000s.

Autoencoders serve multiple purposes, including dimensionality reduction, feature extraction, data compression, and denoising. Several important variants exist:

Variant	Description
Standard autoencoder	Learns a deterministic mapping from input to a compressed latent code and back
Denoising autoencoder (DAE)	Trained on corrupted inputs (Vincent et al. 2008), learning to reconstruct the original clean data, which improves robustness
Sparse autoencoder (SAE)	Enforces sparsity in the latent code so that most entries are near zero, encouraging the model to discover compact features
Contractive autoencoder (CAE)	Adds a regularization penalty based on the Jacobian of the encoder, encouraging the learned representation to be insensitive to small input perturbations
Variational autoencoder (VAE)	Maps inputs to probability distribution parameters rather than fixed vectors (Kingma and Welling 2013), enabling generative modeling
Masked autoencoder (MAE)	Masks a high fraction of input patches and reconstructs them; proposed for vision by He et al. (2022)

Denoising autoencoders

Denoising autoencoders, introduced by Pascal Vincent and colleagues at ICML 2008, train the encoder to map a corrupted input back to the clean original. The corruption can be Gaussian noise, dropout-style erasure, or salt-and-pepper masking. By making the encoder produce useful representations even when its input is partly destroyed, denoising training tends to discover more robust features than vanilla reconstruction. Stacked denoising autoencoders (Vincent et al. 2010) extend this to deep networks by training one denoising layer at a time, an approach that briefly competed with deep belief networks before backpropagation through deep architectures became routine.

The denoising principle quietly underlies a lot of modern self-supervised learning. BERT's masked language modeling is a denoising objective at the token level. Diffusion models train a network to denoise images at every noise level. The masked autoencoder for vision is a direct intellectual descendant.

Variational autoencoders

Variational autoencoders (VAEs), introduced by Kingma and Welling in 2013, extend the autoencoder concept by introducing a probabilistic latent space. Unlike a standard autoencoder, where the encoder outputs a single fixed point in latent space, the VAE encoder outputs the parameters of a probability distribution, typically the mean (μ) and log-variance (log σ²) of a Gaussian distribution for each latent dimension.

To generate a latent sample, the model uses the reparameterization trick: z = μ + σ ⊙ ε, where ε is sampled from a standard normal distribution N(0, 1). This trick allows gradients to flow back through the sampling operation, making end-to-end training possible via backpropagation.

The VAE encoder is trained with a loss function that combines reconstruction error with a Kullback-Leibler (KL) divergence term, which encourages the learned latent distribution to remain close to a standard normal prior. This regularization produces a smooth, continuous latent space that is well-suited for generating new data samples and for interpolation between examples.

Masked autoencoders for vision

The masked autoencoder (MAE), proposed by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick in 2022, applies the masked-reconstruction idea to images. MAE divides an image into patches, hides a large fraction of them (typically 75%), and trains an asymmetric encoder-decoder to reconstruct the missing pixels. The encoder is a standard ViT that processes only the visible patches, ignoring the masked ones entirely. A small decoder then receives the encoded visible patches plus learnable mask tokens and reconstructs the full image.

The asymmetry is what makes MAE practical at scale. Because the encoder never sees mask tokens during pre-training, the bulk of the compute is spent on a small fraction of the input patches. This makes training roughly three times faster than methods that feed mask tokens to the encoder, while improving accuracy. A vanilla ViT-Huge trained with MAE on ImageNet-1K reached 87.8% top-1 accuracy using only that dataset's labels for fine-tuning, the best result among methods that did not pull in extra labeled data.

Sequence-to-sequence encoders

The encoder-decoder framework became a standard paradigm in natural language processing through seq2seq architectures developed in the early 2010s. Two papers in 2014 effectively launched the modern era. Cho, van Merriënboer, and colleagues introduced the encoder-decoder framework using recurrent networks with a new gated recurrent unit (GRU) for statistical machine translation. Sutskever, Vinyals, and Le at Google followed with Sequence to Sequence Learning with Neural Networks, which used a multi-layer LSTM as both encoder and decoder. Their model reached a BLEU score of 34.8 on the WMT'14 English-to-French test set, beating a strong phrase-based statistical system.

In a sequence-to-sequence model, the encoder processes an input sequence (such as a sentence in a source language) token by token using a recurrent neural network (RNN), long short-term memory (LSTM) network, or gated recurrent unit (GRU). After processing the entire input, the encoder produces a final hidden state called the context vector, which summarizes the meaning of the input sequence.

The decoder then uses this context vector to generate the output sequence one token at a time. This architecture was successfully applied to machine translation, text summarization, and dialogue systems, and remained the dominant paradigm in NLP from 2014 until transformers took over starting in 2017.

The bottleneck problem

A fundamental limitation of early seq2seq models was the information bottleneck. Because the encoder compressed the entire input into a single fixed-size context vector, long or complex input sequences would lose information during encoding. The context vector had to contain a complete summary of the input, regardless of its length, which led to degraded performance on longer sequences.

This bottleneck was addressed by the attention mechanism, proposed by Bahdanau, Cho, and Bengio in 2014. Rather than relying solely on the final encoder hidden state, attention allows the decoder to look at all encoder hidden states at each decoding step. The decoder learns to assign different weights to different encoder positions, effectively letting it focus on the most relevant parts of the input for each output token. This eliminated the need to compress all information into a single vector and dramatically improved performance on tasks involving long sequences. Attention turned out to be the conceptual key that unlocked the transformer three years later.

The transformer encoder

The transformer architecture, introduced by Vaswani et al. in the landmark 2017 paper Attention Is All You Need, replaced recurrence entirely with self-attention. The original transformer used an encoder-decoder structure for machine translation, and its encoder half has since been the basis for an enormous family of encoder-only models.

The encoder in the original transformer consists of a stack of 6 identical layers. Each layer contains two sub-layers:

A multi-head self-attention sub-layer in which each token attends to every other token in the input sequence simultaneously. The original model used 8 parallel attention heads, each with a dimension of 64. Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.
A position-wise feed-forward network (FFN) defined as FFN(x) = ReLU(xW₁ + b₁)W₂ + b₂, applied independently to each position. The feed-forward dimension (d_ff) in the original model was 2048, with a model dimension (d_model) of 512.

Each sub-layer is wrapped with a residual connection and layer normalization. The entire encoder processes all input tokens in parallel (unlike RNNs, which process tokens sequentially), making transformers far more efficient to train on modern hardware.

A critical property of the transformer encoder is that it uses bidirectional attention: each token can attend to all other tokens in the input, both preceding and following. This contrasts with the decoder, which uses causal (masked) attention to prevent tokens from attending to future positions. Bidirectionality is what makes transformer encoders so good at understanding tasks; it is also why they cannot be used directly for autoregressive text generation.

Because self-attention is permutation-invariant, the transformer encoder needs an explicit signal of token order. The original paper added sinusoidal positional encoding vectors to the input embeddings; later models have used learned positional embeddings, relative position bias, and rotary position embeddings (RoPE), among others.

Encoder-only language models

Encoder-only architectures use just the encoder stack of the transformer, without a decoder. These models process the full input with bidirectional self-attention and produce contextualized representations of each token. They are designed for natural language understanding (NLU) tasks rather than text generation.

The most influential encoder-only model is BERT (Bidirectional Encoder Representations from Transformers), published by Devlin et al. in 2018. BERT is pre-trained on two objectives: masked language modeling (MLM), where 15% of input tokens are randomly masked and the model predicts them, and next sentence prediction (NSP), where the model predicts whether two sentences are consecutive. Of the 15% masked positions, the original recipe replaces 80% with the special [MASK] token, 10% with a random token, and 10% with the unchanged original token. This trick was meant to soften the pretrain-finetune mismatch caused by the fact that downstream inputs never contain [MASK].

BERT showed that a single pre-trained encoder, fine-tuned with a small head, could set state-of-the-art results across an entire benchmark suite (GLUE, SQuAD, SWAG) at once. The paper is one of the most cited in modern NLP, and the BERT-style training recipe spawned a long line of successors.

Model	Year	Layers	Hidden size	Parameters	Key innovation
BERT-Base	2018	12	768	110M	Bidirectional MLM + NSP pre-training
BERT-Large	2018	24	1024	340M	Scaled-up version of BERT-Base
RoBERTa	2019	24	1024	355M	Removed NSP, dynamic masking, larger batches, more data
ALBERT	2019	up to 24	768-4096	12M-235M	Cross-layer parameter sharing, factorized embeddings
DistilBERT	2019	6	768	66M	Knowledge distillation; 60% smaller, 97% of performance
ELECTRA	2020	12	768	110M	Replaced-token detection instead of MLM
DeBERTa-v3	2021	24	1024	304M-1.5B	Disentangled attention; first to beat human SuperGLUE
XLNet	2019	24	1024	340M	Permutation language modeling, autoregressive variant

RoBERTa (Liu et al. 2019) is essentially BERT with the training cranked harder: it removed the next-sentence-prediction task, used dynamic masking that changes from epoch to epoch, trained on roughly ten times more data, and used larger batches. It outperformed the original BERT on most benchmarks without changing the architecture at all, becoming a strong reminder that pre-training recipes matter as much as model design.

ELECTRA (Clark, Luong, Le, and Manning 2020) replaces the masked-language-modeling objective with replaced-token detection. A small generator network corrupts a fraction of input tokens by sampling plausible replacements, and the main encoder learns to classify each token as original or replaced. Because the loss is computed over every token (not just the 15% that BERT masks), ELECTRA learns far more sample-efficiently. A small ELECTRA model trained on a single GPU for four days outperformed the original GPT, which used 30 times more compute, on the GLUE benchmark.

DeBERTa (Pengcheng He et al. 2020) introduces disentangled attention, which represents each word using two separate vectors (content and position) and computes attention scores using disentangled matrices on content and relative position. DeBERTa-v3 combined this with ELECTRA-style replaced-token detection and was the first model to surpass human baseline performance on the SuperGLUE benchmark when scaled to 1.5 billion parameters.

Encoder-only models are widely used for text classification, sentiment analysis, named entity recognition (NER), question answering, semantic role labeling, and generating dense text embeddings for retrieval systems. Despite the dominance of decoder-only LLMs in headline benchmarks, encoder models remain the workhorse of production NLP because they are much smaller, faster, and easier to fine-tune for a specific task.

Encoder-decoder models

Encoder-decoder transformer models retain both the encoder and decoder stacks. The encoder processes the input with bidirectional attention, while the decoder generates output autoregressively using causal attention and cross-attention to the encoder's output. This architecture is natural for tasks where the input and output are both sequences but their relationship is not strictly aligned, such as translation, summarization, and structured generation.

Prominent encoder-decoder models include:

Model	Developer	Year	Parameters	Description
Original Transformer	Google (Vaswani et al.)	2017	~100M	6 encoder + 6 decoder layers; introduced self-attention for seq2seq
T5	Google Research (Raffel et al.)	2019	60M to 11B	Text-to-Text Transfer Transformer; frames all NLP tasks as text generation
BART	Meta AI (Lewis et al.)	2019	~140M-400M	BERT-style encoder with GPT-style decoder; pre-trained with text corruption
mT5	Google	2020	up to 13B	Multilingual T5 trained on 101 languages
Flan-T5	Google	2022	up to 11B	T5 fine-tuned with instruction tuning
UL2	Google	2022	20B	Unified pre-training over multiple denoising objectives

T5 uses a text-to-text framework where every NLP task, including translation, summarization, classification, and question answering, is converted into a text generation problem by prepending a task-specific prefix (for example, "summarize: ..."). BART is pre-trained by corrupting text with token masking, sentence permutation, and text infilling, and training the decoder to reconstruct the original. Both models closely follow the original transformer architecture, with T5 using relative positional embeddings instead of sinusoidal encoding.

Sentence and embedding encoders

Text encoders transform text into dense vector representations (embeddings) that capture semantic meaning. While early approaches like Word2Vec and GloVe generated static, context-independent word embeddings, modern text encoders produce contextual embeddings where the same word can have different representations depending on its surrounding context.

Vanilla BERT is not optimal for sentence-level embeddings because its token-level outputs need to be aggregated (typically through pooling) to produce a single sentence vector. Reimers and Gurevych showed in 2019 that simply mean-pooling the BERT outputs produces sentence embeddings worse than averaging GloVe vectors. Comparing pairs of sentences with vanilla BERT also requires running both sentences through the model jointly, which is hopelessly slow at scale: finding the most similar pair in a collection of 10,000 sentences requires roughly 50 million inference passes (about 65 hours).

Sentence-BERT (SBERT), introduced by Nils Reimers and Iryna Gurevych at EMNLP 2019, addresses this by fine-tuning BERT-style models using Siamese and triplet network structures. Two encoders share weights and produce sentence vectors independently; the loss pulls semantically similar sentence pairs together in vector space and pushes dissimilar pairs apart. SBERT reduced the time to find the most similar pair in 10,000 sentences from 65 hours to roughly 5 seconds, while maintaining the accuracy of joint BERT scoring. The SentenceTransformers library that grew out of this work has become the standard framework for training and serving sentence-encoder models.

Google's Universal Sentence Encoder (Cer et al. 2018) preceded SBERT and offered two variants: a large transformer-based model and a smaller deep averaging network (DAN). Both produce 512-dimensional embeddings and are still used in production for short-text similarity tasks, especially when latency matters more than absolute accuracy.

More recent text encoders push the size and quality further. The MTEB (Massive Text Embedding Benchmark) leaderboard tracks dozens of competing models, including E5, BGE, GTE, Nomic Embed, Jina Embeddings, Stella, and various OpenAI and Cohere proprietary embeddings. Most use a BERT-style or DeBERTa-style encoder backbone fine-tuned with contrastive losses on hundreds of millions of weakly labeled pairs.

Vision encoders

In computer vision, encoders extract spatial features from images and map them to a representation suitable for downstream tasks such as classification, detection, segmentation, or image retrieval.

CNN-based encoders

Convolutional neural networks (CNNs) have long served as the dominant image encoder architecture. Models such as ResNet (He et al. 2015), VGG (Simonyan and Zisserman 2014), Inception (Szegedy et al. 2015), and EfficientNet (Tan and Le 2019) act as feature extraction backbones, processing images through successive convolutional layers that capture increasingly abstract visual features. The early layers learn edges and textures; deeper layers learn parts and object-like patterns. Pooled or average-pooled features from the last convolutional block can be used directly for classification or passed to a decoder for tasks like image segmentation. These backbones still power many production systems, especially on edge devices where transformer attention is too memory-hungry.

Vision Transformer (ViT)

The Vision Transformer (ViT), introduced by Dosovitskiy et al. in 2020, adapts the transformer encoder for image understanding. Rather than processing pixel arrays with convolutions, ViT divides an image into fixed-size patches (typically 16x16 pixels), linearly embeds each patch, adds positional embeddings, and feeds the resulting sequence of tokens through a standard transformer encoder. ViT demonstrated that pure transformer encoders can match or surpass CNNs on image classification when trained on large datasets such as JFT-300M. The ViT-Base/16, ViT-Large/16, and ViT-Huge/14 variants have become standard backbones for downstream vision work.

Swin Transformer

Swin Transformer (Liu et al. 2021) introduces a hierarchical encoder built around shifted local attention windows. Self-attention is computed within non-overlapping local windows; alternating blocks shift the window grid so that information leaks across window boundaries every other layer. This brings the computational pattern closer to a convolutional pyramid and makes the encoder well-suited for dense prediction tasks like detection and segmentation. Swin won the ICCV 2021 Marr Prize for its strong results on ImageNet (87.3% top-1), COCO detection (58.7 box AP, 51.1 mask AP), and ADE20K segmentation (53.5 mIoU).

DINO and DINOv2

DINO (Caron et al. 2021) and its successor DINOv2 (Oquab et al. 2024) are self-supervised vision encoders trained without any image labels. DINOv2 was published in Transactions on Machine Learning Research in early 2024 by a team led by Maxime Oquab at Meta AI. The model is pre-trained on a curated dataset of 142 million images using a discriminative self-distillation objective with momentum encoders.

The selling point of DINOv2 is that its frozen features work well across many downstream vision tasks (classification, depth estimation, segmentation, instance retrieval) with only a linear probe on top. No fine-tuning of the backbone is required. The training pipeline is also roughly twice as fast and uses three times less memory than comparable discriminative self-supervised methods. DINOv2 features have become a popular off-the-shelf vision backbone for robotics, medical imaging, and any setting where labeled training data is scarce.

MAE as a vision encoder

The masked autoencoder (described in the autoencoder section above) doubles as a self-supervised vision encoder. After pre-training, the MAE encoder is detached from the lightweight decoder and used on its own as a feature extractor or fine-tuned for downstream tasks. MAE features are competitive with or stronger than supervised ImageNet pre-training on tasks such as object detection and semantic segmentation.

Multimodal encoders

Multimodal encoders project two or more modalities into a shared vector space, where vectors close together represent semantically related content across modalities.

CLIP

CLIP (Contrastive Language-Image Pre-training), developed by OpenAI in 2021, is the canonical example. It uses two separate encoders, one for images and one for text, trained jointly with a contrastive loss on 400 million image-text pairs scraped from the web. For each batch of N pairs, the model maximizes the cosine similarity between matching image-text vectors and minimizes it between the N²-N mismatched pairs, using a symmetric cross-entropy loss on the resulting similarity matrix. The image encoder in CLIP is typically a ViT (e.g., ViT-L/14), though ResNet variants exist. The text encoder is a 12-layer GPT-style transformer (used here as an encoder by simply pooling its outputs).

The payoff is that CLIP can perform zero-shot classification on arbitrary image categories without any task-specific fine-tuning. To classify an image, you embed the image and embed the text strings "a photo of a cat," "a photo of a dog," and so on, then pick the text string with the highest cosine similarity to the image. CLIP also became the workhorse text-image alignment module in many later systems, including Stable Diffusion and various open-source vision-language models.

SigLIP and LiT

SigLIP (Zhai et al., ICCV 2023) replaces CLIP's softmax-normalized contrastive loss with a simple pairwise sigmoid loss. Each image-text pair is treated as an independent binary classification problem (matched or not), so no global normalization across the batch is required. This change lets training scale to much larger batch sizes, performs better at small batch sizes, and works well even on modest hardware. With only four TPUv4 chips, the SigLIP authors trained a Large LiT model at batch size 20k that reached 84.5% ImageNet zero-shot accuracy in two days.

LiT (Zhai et al. 2022) introduced contrastive locked image tuning: the image encoder is initialized from a strong pre-trained vision model and frozen, while only the text encoder is trained to align with it. This decouples representation learning from alignment and tends to produce better zero-shot transfer than training both encoders from scratch.

Other multimodal encoders

Google's ALIGN, Meta's ImageBind, and Apple's MM-CLIP variants extend the same dual-encoder pattern to more modalities and noisier data. ImageBind in particular learns a single embedding space across six modalities (images, text, audio, depth, thermal, IMU) using image as the anchor.

Audio and speech encoders

Audio encoders convert raw waveforms or spectrograms into vector representations that downstream models can use for transcription, classification, separation, and speaker identification.

wav2vec 2.0 (Baevski, Zhou, Mohamed, and Auli 2020) is a self-supervised speech encoder built around a convolutional feature encoder followed by a transformer context encoder. It masks portions of the latent speech representation and trains the model to identify the correct quantized representation of the masked region from a set of distractors using a contrastive loss. After pre-training on 53,000 hours of unlabeled audio, fine-tuning on as little as ten minutes of transcribed speech produced word-error rates of 4.8%/8.2% on the LibriSpeech clean/other test sets, the first demonstration that strong speech recognition was possible with very little labeled data.

Whisper (Radford et al., OpenAI 2022) is a transformer encoder-decoder trained on 680,000 hours of weakly supervised multilingual audio. The encoder ingests an 80-channel log-Mel spectrogram derived from 16 kHz audio with 25 ms windows and a 10 ms stride. Through scale and weak supervision rather than self-supervision, Whisper produces transcripts that are competitive with prior fully supervised systems in a zero-shot setting and works across 99 languages.

HuBERT (Hsu et al. 2021) replaces wav2vec 2.0's contrastive loss with a clustering-based objective that predicts the discrete cluster labels of masked speech segments. USM (Universal Speech Model, Google 2023) and Meta's MMS (Massively Multilingual Speech, 2023) push speech encoders to thousands of languages.

Code encoders

A family of encoder models specialize in source code understanding. CodeBERT (Feng et al. 2020), released by Microsoft, is a BERT-style bimodal encoder trained on natural-language and programming-language pairs from GitHub across six languages (Python, Java, JavaScript, PHP, Ruby, Go). It uses ELECTRA-style replaced-token detection during pre-training and supports tasks like code search, code documentation generation, and code-natural-language alignment.

GraphCodeBERT (Guo et al. 2021) extends CodeBERT by injecting data-flow graphs into pre-training, capturing structural dependencies between variables. UniXcoder (Guo et al. 2022) unifies cross-modal pre-training across code, comments, and abstract syntax trees, and works as both an encoder and a decoder. StarEncoder (BigCode 2023) and CodeT5+ (Wang et al. 2023) provide modern open code encoders backing tools like AI code search, vulnerability detection, and clone detection.

Training objectives

Encoders look superficially similar across modalities; what distinguishes them is the loss they were trained with. The dominant pre-training objectives are:

Objective	Description	Example models
Reconstruction	Predict the original input from a compressed or corrupted version	Autoencoder, denoising autoencoder, VAE, MAE
Masked language modeling (MLM)	Predict tokens randomly hidden from the input; bidirectional context	BERT, RoBERTa, DeBERTa, mBERT, XLM-R
Replaced-token detection	Classify each token as original or replaced by a generator	ELECTRA, DeBERTa-v3, CodeBERT
Contrastive	Pull matched pairs together and push unmatched pairs apart in vector space	SimCLR, MoCo, CLIP, SigLIP, Sentence-BERT, wav2vec 2.0
Self-distillation	Match the encoder's output to a momentum-averaged teacher copy of itself	DINO, DINOv2, BYOL
Next-sentence prediction (NSP)	Classify whether two sentences are consecutive in the corpus	BERT (later removed by RoBERTa)
Permutation language modeling	Predict tokens in a random factorization order	XLNet
Span corruption	Mask contiguous spans and predict them	T5, SpanBERT

Not every objective ages well. Next-sentence prediction was central to BERT but was found to be redundant or even harmful by RoBERTa and ALBERT. Sparse-coding constraints in autoencoders briefly looked like the future in the late 2000s and have largely faded except in interpretability work on language models. Contrastive learning, by contrast, has expanded from images (SimCLR, MoCo) into text (Sentence-BERT, SimCSE), audio (wav2vec 2.0), and multimodal settings (CLIP, SigLIP), becoming arguably the dominant self-supervised paradigm of the 2020s.

Pre-trained encoders and transfer learning

One of the most consequential developments in modern deep learning is the use of pre-trained encoders for transfer learning. Rather than training an encoder from scratch for each task, a large encoder model is first pre-trained on a massive unlabeled or weakly labeled dataset using self-supervised objectives. The pre-trained encoder captures general-purpose representations that can then be fine-tuned on smaller, task-specific labeled datasets.

This approach, popularized by ULMFiT and ELMo and then cemented by BERT, has been extended to vision (ViT, MAE, DINOv2, CLIP), audio (wav2vec 2.0, Whisper, HuBERT), and code (CodeBERT, StarEncoder). It has several advantages:

Data efficiency: Fine-tuning requires far less labeled data than training from scratch, sometimes by orders of magnitude.
Performance: Pre-trained representations encode broad linguistic, visual, or acoustic knowledge that transfers well to downstream tasks.
Adaptability: A single pre-trained encoder can be fine-tuned for many different tasks (classification, NER, semantic similarity, question answering, retrieval).
Lower cost: Only the lightweight task-specific head needs to be trained for many downstream applications, especially with parameter-efficient methods like LoRA.

Research has shown that fine-tuning primarily affects the top layers of pre-trained encoder models, while lower layers retain general features. Importantly, fine-tuning does not lead to catastrophic forgetting of the linguistic or visual knowledge learned during pre-training in most cases, though it can in low-data regimes.

Contextual embeddings

A defining feature of modern encoder architectures is their ability to produce contextual embeddings. Unlike static word vectors (such as Word2Vec or GloVe), where each word has a single fixed representation regardless of context, encoder-based models like BERT generate representations that depend on the entire input context.

For example, the word "bank" in "river bank" and "bank account" receives different vector representations from a contextual encoder, because the self-attention mechanism allows each token's representation to incorporate information from all surrounding tokens. These contextual embeddings have proven far more effective for tasks requiring nuanced language understanding, including word sense disambiguation, coreference resolution, and natural language inference.

Contextual embeddings from pre-trained encoders can also be extracted and used as features for other models, a technique sometimes called feature-based transfer learning, as opposed to full fine-tuning. The choice between feature extraction and fine-tuning depends on dataset size, compute budget, and how different the downstream task is from pre-training: small datasets close to the pre-training distribution often benefit from feature extraction with a frozen encoder, while large datasets or distant tasks usually justify full fine-tuning.

Encoder vs decoder vs encoder-decoder

The three transformer architecture families differ in their attention pattern, training objective, and natural use case. Each family is optimized for a different shape of problem.

Property	Encoder-only	Decoder-only	Encoder-decoder
Attention type	Bidirectional	Causal (masked)	Bidirectional encoder + causal decoder + cross-attention
Typical pre-training	MLM, replaced-token detection	Next-token prediction	Span corruption, denoising
Output type	Per-token contextual representation	Generated text token by token	Generated text from input sequence
Best at	Classification, NER, retrieval, embeddings	Open-ended generation, in-context learning	Translation, summarization, structured generation
Can generate text?	No (without modifications)	Yes, autoregressively	Yes, with cross-attention to input
Example models	BERT, RoBERTa, ELECTRA, DeBERTa	GPT, LLaMA, Claude, Mistral	T5, BART, mT5, Flan-T5
Typical size (2024-2026)	50M-1.5B parameters	1B-1T+ parameters	100M-30B parameters

The historical arc went encoder-decoder (Vaswani 2017), then a brief era where encoder-only (BERT) and decoder-only (GPT) split off as separate research directions, and then a striking convergence on decoder-only as the architecture for general-purpose generative LLMs after GPT-3. Encoders did not disappear; they retreated into the roles where they are still clearly best, namely retrieval, classification, and embedding generation. Most modern RAG systems use a decoder-only LLM for generation but a separate encoder-only model (or a contrastive dual encoder) for the retrieval step.

Use cases

Encoders show up wherever you need a compact, comparable representation of an input rather than a generated output. The most common deployments are:

Classification: Sentiment analysis, intent detection, toxicity classification, spam filtering, NER, image classification, audio event detection. A frozen encoder plus a small head is often sufficient.
Retrieval and semantic search: Encoders embed queries and documents into a shared space; nearest-neighbor search returns the closest matches. This is the foundation of dense passage retrieval (DPR) and most modern search and RAG systems.
Clustering and topic modeling: Sentence or document embeddings cluster naturally by topic, supporting tools like BERTopic and exploratory analysis of large corpora.
Anomaly and fraud detection: Embeddings of normal behavior define a region in vector space; outliers are flagged.
Recommendation systems: User and item encoders project both into a shared embedding space; products are ranked by similarity. Two-tower models are a direct industrial application of the dual-encoder pattern that powers CLIP and DPR.
RAG (Retrieval-Augmented Generation): A dense encoder retrieves relevant passages from a knowledge base, which a generative LLM then conditions on. The encoder is the key to letting an LLM access information it was not trained on.
Cross-modal search: CLIP-style encoders enable text-to-image search ("find me photos that look like this caption"), image-to-image retrieval, and zero-shot classification.
Pre-training base for downstream models: A pre-trained encoder is often the starting point for further task-specific fine-tuning.

Limitations

Encoders have well-known limitations that shape when to reach for them and when not to.

The most fundamental limitation is that a pure encoder cannot generate sequences. BERT can tell you whether a sentence is positive or negative, but it cannot write you a positive sentence. Open-ended generation requires either a decoder, a separate generative head, or an encoder-decoder architecture.

BERT-style encoders also suffer from a pretrain-finetune mismatch caused by the [MASK] token. During pre-training, the encoder sees roughly 12% [MASK] tokens (15% of 80% being replaced with [MASK]), but at inference time it never sees [MASK] at all. This artificial gap between training and deployment was one of the motivations for ELECTRA's replaced-token-detection objective and for the careful 80/10/10 masking trick in the original BERT recipe.

Fixed-size sentence and image embeddings throw away spatial or sequential structure, which limits their use for tasks that need fine-grained alignment (token-level or pixel-level). Transformer encoders also scale quadratically in sequence length because of self-attention, which makes very long documents (entire books, hours of audio) expensive to encode without efficient-attention tricks.

Dual-encoder retrieval models can be brittle to query styles they did not see in training and tend to underperform cross-encoders (which jointly attend to query and document) on the actual relevance judgment. Production retrieval systems often combine a fast bi-encoder for the first stage with a slower but more accurate cross-encoder reranker for the top results.

Finally, large pre-trained encoders inherit the biases and gaps of their training data: BERT is overwhelmingly English-centric, CLIP reflects the demographics and stereotypes present in image-caption data scraped from the web, and code encoders absorb the styles and bugs of the open-source corpora they were trained on. These biases can propagate silently into any downstream system built on top of the encoder.

Modern relevance

The rise of decoder-only large language models such as GPT-4, Claude, and Gemini might suggest that encoders have become irrelevant. The reality is more interesting. Encoders dominate retrieval, embedding generation, and classification, three categories that turn out to be enormous in practice. They sit at the heart of search engines, RAG pipelines, vector databases, recommendation systems, content moderation, and most of the perception layers of multimodal AI. The training compute spent on contrastive vision-language encoders like CLIP, SigLIP, and DINOv2 in 2023-2025 is comparable to the compute spent on small language models.

Even the largest decoder-only LLMs typically rely on encoder-style components for their input modalities. GPT-4 with vision, Claude's vision support, and Gemini all use a vision encoder (often a ViT trained with contrastive or MAE objectives) to convert images into token-like vectors that the LLM can attend to. Audio modes use encoders descended from Whisper or wav2vec. Tool-using and agentic LLMs route through embedding-based tool retrieval that is essentially an encoder lookup.

In other words, the modern stack is layered: encoders convert raw modalities and large knowledge stores into compact representations, and decoder-only LLMs reason over those representations to produce useful output. Each architecture earns its place by being the best fit for its slice of the problem.

Explain like I'm 5 (ELI5)

Imagine you have a huge box of LEGO pieces. An encoder is like a friend who looks at all the pieces and draws a quick picture that captures the most important things about what is in the box: the colors, the shapes, the overall idea. That picture is much simpler than the actual pile of LEGO, but it still tells you what you need to know. Later, another friend (the decoder) can look at that picture and try to rebuild the LEGO set, or answer questions about it, or sort it into categories. The encoder's job is to take something big and complicated and turn it into something smaller and easier to work with, while keeping the important details.

Different encoders draw different kinds of pictures. A vision encoder looks at a photo and writes down a list of numbers that captures what the photo is about. A text encoder reads a sentence and writes down a list of numbers that captures what the sentence means. If two photos show similar things, their lists of numbers look similar; if two sentences mean similar things, their lists of numbers look similar too. That is what makes search engines, recommendation systems, and chatbots that can see images all work.

References

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). "Attention Is All You Need." *Advances in Neural Information Processing Systems (NeurIPS) 30*. arXiv:1706.03762.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." *Proceedings of NAACL-HLT*. arXiv:1810.04805.
Hinton, G. E. & Salakhutdinov, R. R. (2006). "Reducing the Dimensionality of Data with Neural Networks." *Science*, 313(5786), 504-507.
Vincent, P., Larochelle, H., Bengio, Y., & Manzagol, P.-A. (2008). "Extracting and Composing Robust Features with Denoising Autoencoders." *ICML 2008*.
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., & Manzagol, P.-A. (2010). "Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion." *Journal of Machine Learning Research*, 11, 3371-3408.
Kingma, D. P. & Welling, M. (2013). "Auto-Encoding Variational Bayes." arXiv:1312.6114.
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). "Sequence to Sequence Learning with Neural Networks." *NeurIPS 2014*. arXiv:1409.3215.
Cho, K., van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation." *EMNLP 2014*. arXiv:1406.1078.
Bahdanau, D., Cho, K., & Bengio, Y. (2014). "Neural Machine Translation by Jointly Learning to Align and Translate." arXiv:1409.0473.
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). "Masked Autoencoders Are Scalable Vision Learners." *CVPR 2022*. arXiv:2111.06377.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale." *ICLR 2021*. arXiv:2010.11929.
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows." *ICCV 2021*. arXiv:2103.14030.
Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al. (2024). "DINOv2: Learning Robust Visual Features without Supervision." *Transactions on Machine Learning Research*. arXiv:2304.07193.
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). "Emerging Properties in Self-Supervised Vision Transformers." *ICCV 2021*. arXiv:2104.14294.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). "Learning Transferable Visual Models From Natural Language Supervision." *ICML 2021*. arXiv:2103.00020 (CLIP paper).
Zhai, X., Mustafa, B., Kolesnikov, A., & Beyer, L. (2023). "Sigmoid Loss for Language Image Pre-Training." *ICCV 2023*. arXiv:2303.15343 (SigLIP paper).
Reimers, N. & Gurevych, I. (2019). "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks." *EMNLP-IJCNLP 2019*. arXiv:1908.10084.
Cer, D., Yang, Y., Kong, S.-y., Hua, N., Limtiaco, N., John, R. S., Constant, N., et al. (2018). "Universal Sentence Encoder." arXiv:1803.11175.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). "RoBERTa: A Robustly Optimized BERT Pretraining Approach." arXiv:1907.11692.
Clark, K., Luong, M.-T., Le, Q. V., & Manning, C. D. (2020). "ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators." *ICLR 2020*. arXiv:2003.10555.
He, P., Liu, X., Gao, J., & Chen, W. (2020). "DeBERTa: Decoding-enhanced BERT with Disentangled Attention." *ICLR 2021*. arXiv:2006.03654.
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019). "ALBERT: A Lite BERT for Self-supervised Learning of Language Representations." arXiv:1909.11942.
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." *Journal of Machine Learning Research*, 21(140), 1-67 (T5 paper).
Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., & Zettlemoyer, L. (2020). "BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension." *ACL 2020*. arXiv:1910.13461.
Baevski, A., Zhou, H., Mohamed, A., & Auli, M. (2020). "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations." *NeurIPS 2020*. arXiv:2006.11477.
Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). "Robust Speech Recognition via Large-Scale Weak Supervision." arXiv:2212.04356 (Whisper paper).
Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong, M., Shou, L., Qin, B., Liu, T., Jiang, D., & Zhou, M. (2020). "CodeBERT: A Pre-Trained Model for Programming and Natural Languages." *Findings of EMNLP 2020*. arXiv:2002.08155.
Karpukhin, V., Oğuz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., & Yih, W.-t. (2020). "Dense Passage Retrieval for Open-Domain Question Answering." *EMNLP 2020*. arXiv:2004.04906.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep Learning*. MIT Press.

How encoders work

A taxonomy of encoders

Encoders in autoencoders

Denoising autoencoders

Variational autoencoders

Masked autoencoders for vision

Sequence-to-sequence encoders

The bottleneck problem

The transformer encoder

Encoder-only language models

Encoder-decoder models

Sentence and embedding encoders

Vision encoders

CNN-based encoders

Vision Transformer (ViT)

Swin Transformer

DINO and DINOv2

MAE as a vision encoder

Multimodal encoders

CLIP

SigLIP and LiT

Other multimodal encoders

Audio and speech encoders

Code encoders

Training objectives

Pre-trained encoders and transfer learning

Contextual embeddings

Encoder vs decoder vs encoder-decoder

Use cases

Limitations

Modern relevance

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

Sparse autoencoder

GELU (Gaussian Error Linear Unit)

LeNet

LSTM

Transformers

Bidirectional

How encoders work

A taxonomy of encoders

Encoders in autoencoders

Denoising autoencoders

Variational autoencoders

Masked autoencoders for vision

Sequence-to-sequence encoders

The bottleneck problem

The transformer encoder

Encoder-only language models

Encoder-decoder models

Sentence and embedding encoders

Vision encoders

CNN-based encoders

Vision Transformer (ViT)

Swin Transformer

DINO and DINOv2

MAE as a vision encoder

Multimodal encoders

CLIP

SigLIP and LiT

Other multimodal encoders

Audio and speech encoders

Code encoders

Training objectives

Pre-trained encoders and transfer learning

Contextual embeddings

Encoder vs decoder vs encoder-decoder

Use cases

Limitations

Modern relevance

Explain like I'm 5 (ELI5)

References

Related Articles

Sparse autoencoder

GELU (Gaussian Error Linear Unit)

LeNet

LSTM

Transformers