Two-Tower Model

The two-tower model, also known as the dual encoder, bi-encoder, or Siamese network for retrieval, is a neural network architecture that learns to map two related inputs (such as a query and a document, or a user and an item) into a shared low-dimensional vector space. Each input is processed by a separate sub-network, called a tower or encoder, and the relevance between any two inputs is computed by a simple similarity function such as the dot product or cosine similarity between their resulting embedding vectors.

Because the query and item towers are independent at inference, all item embeddings can be precomputed once and indexed in an approximate nearest neighbor (ANN) library such as FAISS, ScaNN, or HNSW. At serving time only the query passes through its tower, and a k-nearest neighbors lookup retrieves the most similar items in milliseconds even from corpora of hundreds of millions of candidates. This decoupling of encoding from scoring is why the two-tower model has become the de facto retrieval and candidate-generation architecture in modern web-scale recommender system and search stacks at Google, YouTube, Pinterest, TikTok, Netflix, Amazon, Meta, and Spotify.

In NLP the same architecture, applied to text passages, underpins the dense passage retrieval family of models that power the retrieval stage of retrieval augmented generation systems used with large language models.

Architecture

A two-tower model consists of two encoder networks (sometimes sharing low-level lookups but not higher layers), trained jointly so that the inner product between their outputs reflects the semantic relationship between their inputs.

The two towers

The query tower (also called the user tower or question encoder) maps the query input to a fixed-length vector. In a recommender system the input is typically the user identifier together with contextual features such as device, time of day, recent browsing history, and short-term session signals. In a search system it is the user-issued text. In QA it is the natural language question.

The item tower (also called the candidate, document, or passage encoder) maps an item to a vector of the same dimensionality. The input is a structured representation of the candidate document, video, product, song, advertisement, or knowledge passage, typically combined with metadata such as category, tags, language, freshness, and item identifier embeddings.

Both towers project into a shared embedding space, commonly 64 to 768 dimensions. The output vectors are usually L2 normalized so that the dot product becomes equivalent to cosine similarity.

Scoring function

Given a query embedding q and an item embedding i, the relevance score is computed by a similarity function. The dot product s(q, i) = qi is by far the most common choice because it is cheap, matches the maximum inner product search problem that ANN libraries optimize for, and aligns with the contrastive learning objectives used during training. Cosine similarity and Euclidean distance can be transformed into equivalent dot products after normalization.

The defining constraint of the architecture is that interaction between query and item only happens at this final scalar score: inside the towers the two inputs never see each other. This contrasts with a cross-encoder, where the query and document are concatenated into a single input sequence and every layer can attend across both. The constraint makes the two-tower model fast at serving but is also the source of its accuracy gap relative to cross-encoders.

Tower implementations

The internals of each tower can be any neural network appropriate for the input modality: multi-layer perceptrons over sparse embedding lookups (the original DSSM and most large recommender systems), transformer encoders such as BERT, RoBERTa, or T5 for text (SBERT, DPR), vision transformers or CNNs for images (CLIP), and graph neural networks for graph-structured user-item data. When the same weights are used in both towers, the model is called a Siamese network; when towers have separate weights, dual encoder is more precise. In modern usage two-tower, dual encoder, and bi-encoder are typically interchangeable.

History and key papers

DSSM (2013)

The two-tower idea was introduced in its modern form by Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck of Microsoft Research in the 2013 CIKM paper Learning Deep Structured Semantic Models for Web Search using Clickthrough Data. DSSM (Deep Structured Semantic Model) was designed to score the relevance between a web search query and a candidate document for Bing.

DSSM used two parallel feed-forward networks, one for the query and one for the document title, both starting with a word-hashing layer of about thirty thousand letter trigram units. Each tower then projected its vector through several fully connected layers down to a 128-dimensional dense embedding. The cosine similarity between the two embeddings was passed through a softmax over a positive document and a small set of randomly sampled negatives, and the model was trained to maximize the conditional likelihood of the clicked document given the query.

DSSM established three ideas that all subsequent two-tower work has built on: independent encoders for query and document, a similarity function applied only at the top, and contrastive training using clickthrough data as implicit relevance labels.

Sentence-BERT (2019)

In 2019, Nils Reimers and Iryna Gurevych at TU Darmstadt published Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks at EMNLP-IJCNLP. The paper showed that BERT, although strong as a cross-encoder, was impractical for semantic similarity search because pairwise scoring over N sentences required O(N) BERT inferences per query (about sixty-five hours for ten thousand sentences).

Reimers and Gurevych wrapped BERT in a Siamese architecture: each sentence is encoded independently by the same BERT model, a pooling layer produces a fixed-length sentence embedding, and the two embeddings are scored by cosine similarity. Trained with a triplet or contrastive loss on natural language inference and semantic textual similarity datasets, SBERT preserved most of BERT's accuracy while reducing the same ten-thousand-sentence search to about five seconds. SBERT made the two-tower paradigm widely accessible and remains the basis of the popular sentence-transformers library.

YouTube two-tower retrieval (2019)

At RecSys 2019, Xinyang Yi, Ji Yang, Lichan Hong, Derek Cheng, Lukasz Heldt, Aditee Kumthekar, Zhe Zhao, Li Wei, and Ed Chi from Google published Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations, describing how YouTube uses a two-tower model for candidate retrieval.

The paper studied the case where the candidate corpus contains hundreds of millions of items and the training signal is in-batch negatives drawn from a power-law distribution of impressions. The authors showed this causes the model to over-penalize popular items, hurting recall. They proposed a logQ correction in which the logit for each in-batch negative is reduced by the log of its sampling probability, restoring an unbiased sampled softmax estimator. The system was deployed in production for YouTube.

The follow-up paper Mixed Negative Sampling for Learning Two-tower Neural Networks in Recommendations (Yang et al., WWW 2020 Companion) introduced mixed negative sampling (MNS), combining in-batch negatives with negatives drawn uniformly from the corpus, which addresses selection bias because in-batch negatives can never include items that have never been impressed. MNS is now standard in industrial two-tower training.

Dense Passage Retrieval (2020)

In 2020, Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih (then at Facebook AI Research and the University of Washington) released dense passage retrieval in the paper Dense Passage Retrieval for Open-Domain Question Answering at EMNLP 2020.

DPR replaced the traditional BM25 retriever in the open-domain QA pipeline with a two-tower BERT model: one BERT encoder produced a 768-dimensional embedding for each Wikipedia passage and a second BERT encoder did the same for each question. The model was trained with a contrastive loss using one positive passage per question, in-batch positives of other questions as negatives, and one or two BM25 hard negatives.

On five open-domain QA benchmarks (Natural Questions, TriviaQA, WebQuestions, CuratedTREC, and SQuAD), DPR outperformed BM25 by nine to nineteen absolute percentage points in top-twenty passage retrieval accuracy. The architecture, contrastive training recipe, and BM25 hard negative trick became the template for almost every modern dense retriever, including E5, GTE, BGE, Contriever, and the OpenAI text-embedding family.

CLIP and multimodal two-tower models (2021)

OpenAI's CLIP, released in 2021, generalized the two-tower paradigm to a different pair of modalities by training one tower over images and another over text captions on four hundred million image-text pairs from the web with an InfoNCE objective. The resulting shared image-text embedding space supports zero-shot image classification and text-to-image search. Models such as ALIGN, OpenCLIP, SigLIP, and EVA-CLIP have continued the same paradigm at larger scale.

Training

The two-tower model is almost always trained with a contrastive learning objective. The model is shown a positive query-item pair and a set of negative items, and is asked to assign a higher score to the positive pair than to any of the negatives.

Loss functions

The most common training objective is the in-batch sampled softmax, also called the InfoNCE loss. For a mini-batch of N positive query-item pairs, the model treats the N items in the batch as a candidate set for each query. The loss for query q with positive item i is:

L(q, i) = -log(exp(s(q, i) / τ) / Σ exp(s(q, i') / τ))

where the sum runs over all N items in the batch and τ is an optional temperature. This is equivalent to a cross-entropy over an N-way classification problem. In-batch negatives are computationally free and enable very large effective batch sizes (often 1,024 to 16,384). Other common objectives include the pairwise BPR loss, the triplet margin loss popularized by FaceNet and Sentence-BERT, hinge loss, and explicit noise contrastive estimation (NCE).

Negative sampling

The choice of how to construct negatives is one of the most important and most studied aspects of two-tower training. The dominant strategies are:

In-batch negatives: treat the positive items of other queries in the same mini-batch as negatives. Free and known to scale well with batch size; used in DPR, SBERT, CLIP, and almost every modern dense retriever.
Mixed Negative Sampling (MNS): combine in-batch negatives with uniformly sampled negatives from the corpus, introduced by Yang et al. 2020 to correct selection bias.
BM25 hard negatives: use the top documents returned by a lexical retriever that do not contain the answer; introduced in DPR to force discrimination between passages with strong lexical overlap.
Mined hard negatives: use a previous version of the dense retriever to find difficult negatives, the basis of the ANCE training procedure and most modern retrievers (E5, BGE, Contriever).
Cross-batch negatives: maintain a FIFO queue of recent embeddings from previous batches as additional negatives.
Cross-encoder distillation: distill a cross-encoder's score distribution into the bi-encoder via the MarginMSE loss, behind models such as msmarco-MiniLM and TAS-B.

Sampling bias correction

When in-batch negatives follow the empirical impression distribution, popular items dominate the negative pool. The logQ correction subtracts each item's log batch-probability from its logit, recovering an unbiased full-corpus softmax. Yi et al. 2019 implemented this with a streaming frequency estimator. The trick is essential when training on real-world traffic logs, otherwise the model learns near-uniform scores for popular items.

Multi-stage training

Large production systems often train in stages: self-supervised pretraining (such as masked language modeling), a contrastive warmup with in-batch negatives, hard negative fine-tuning, and a periodic refresh on fresh interactions.

Serving and indexing

The practical value of the two-tower model lies in how it splits cleanly between offline and online phases.

Offline indexing

Once trained, the item tower is applied once to every item in the corpus to produce a fixed-length embedding. These embeddings are written to an approximate nearest neighbor index, partitioned across many machines for production-scale corpora and rebuilt periodically when new content arrives or the model is retrained. Indexing typically combines product quantization (PQ) to compress vectors, inverted file indexing (IVF) to cluster vectors and search only nearby clusters, Hierarchical Navigable Small World graphs (HNSW) for logarithmic-time graph traversal, and anisotropic vector quantization (ScaNN) which weights dimensions by their contribution to inner-product error.

Online retrieval

At query time only the query tower runs, producing a single query vector. The vector is sent to the ANN index, which returns the top-K most similar items in single-digit milliseconds even when K is in the hundreds and the corpus is in the hundreds of millions. These items are then passed to a second-stage ranker (typically a cross-encoder, a deep ranking network with cross-features, or a tree ensemble) for fine scoring.

ANN libraries

The table below summarizes widely used ANN libraries in two-tower deployments.

Library	Maintainer	Year	Index families	Strengths
FAISS	Meta AI Research	2017	IVF, PQ, HNSW, IVF-PQ, OPQ	Mature CPU/GPU support, widely used in research
ScaNN	Google Research	2020	Anisotropic VQ, asymmetric hashing	State-of-the-art tradeoff for inner-product search
HNSWlib	Yury Malkov	2018	HNSW	Lightweight C++ header-only, fast in-memory
Annoy	Spotify	2015	Random projection trees	Simple file format, used historically at Spotify
Vespa	Vespa.ai (Yahoo)	2017	HNSW, brute force	Search engine with hybrid lexical-dense retrieval
Milvus	Zilliz	2019	IVF, HNSW, IVF-PQ, DiskANN	Distributed vector database
Pinecone	Pinecone	2021	IVF and graph hybrids	Managed cloud vector search
Qdrant	Qdrant	2021	HNSW with payload filters	Open-source vector DB in Rust
Weaviate	Weaviate	2019	HNSW with hybrid BM25	Vector DB with built-in object schema

Comparison with cross-encoders

The key architectural choice in modern neural retrieval is the bi-encoder versus cross-encoder tradeoff. A cross-encoder concatenates the query and a candidate into one input sequence, runs them through a transformer that attends across both, and outputs a relevance score. This enables much richer interaction modeling than the two-tower model, which only interacts via a final dot product, but it requires N forward passes per query for N candidates, which is impractical at retrieval scale.

Aspect	Two-tower (bi-encoder)	Cross-encoder
Encoding	Query and item encoded independently	Query and item encoded jointly
Item embeddings	Precomputed and stored in ANN index	Cannot be precomputed
Scoring per query	Single forward pass + ANN lookup	One forward pass per candidate
Latency at corpus size N	O(1) plus sublinear ANN search	O(N) transformer forward passes
Accuracy	Strong but not optimal	Highest accuracy of mainstream architectures
Use case	Retrieval, candidate generation, semantic search at scale	Reranking small candidate set, classification, pairwise comparison
Typical model	DSSM, SBERT, DPR, E5, BGE	monoBERT, ms-marco-MiniLM-L-6-v2, Cohere Rerank, Cross-Encoder/CE-MiniLM
Training data efficiency	Often needs hard negative mining	Usually trains well with random negatives
Interaction modeling	Only dot product at the end	Full attention across query and document

The near-universal solution in production search and QA is a retrieve-then-rerank pipeline. A two-tower retriever fetches the top one hundred to one thousand candidates, then a cross-encoder reranker rescores only those and returns the final top ten or twenty. This brings the candidate set from billions to hundreds in milliseconds, then spends the cross-encoder budget on a small set where its cost is tolerable, capturing most of the cross-encoder's accuracy at near bi-encoder latency.

Modern extensions

The two-tower paradigm has been extended in many directions to push past its inherent limitations.

Late interaction (ColBERT)

ColBERT, introduced by Omar Khattab and Matei Zaharia in 2020, sits between the two extremes. Each query and document is encoded independently into a bag of token-level embeddings rather than a single pooled vector. At query time, for each query token the maximum similarity over all document tokens is computed (the MaxSim operator), and the per-token maxima are summed into a final score. Because document bags are precomputed, ColBERT keeps the bi-encoder's offline indexing benefits while capturing fine-grained interactions, at the cost of much larger index size. ColBERTv2 added denoised supervision and residual compression to shrink the index toward single-vector retriever sizes.

Matryoshka embeddings

Matryoshka Representation Learning (MRL), introduced by Aditya Kusupati and colleagues in 2022, modifies training so that the first d coordinates of every output vector are themselves a useful embedding for any d up to the full dimension. The model is trained with a multi-scale loss summing the contrastive loss over nested prefix sizes (for example 64, 128, 256, 512, 768). This allows cheap retrieval at low dimension followed by reranking at full dimension (called adaptive retrieval) and safe truncation. OpenAI's text-embedding-3-small and text-embedding-3-large use Matryoshka training.

Scaling laws and large embedding models

Following LLM scaling laws, researchers have studied how dense retriever quality changes with model size, data size, and embedding dimension. Models such as E5-Mistral-7B, GritLM, NV-Embed, and the SFR-Embedding series use multi-billion parameter LLM backbones as the encoder, either distilled or used directly. These large two-tower models top the MTEB benchmark and have closed much of the gap with cross-encoders, though training and serving costs grow correspondingly.

Multi-interest, multi-vector, and graph extensions

Many follow-ups relax the single-vector bottleneck. Multi-interest user encoders (such as MIND from Alibaba) emit several user vectors instead of one, capturing different intents. Multi-vector item representations allocate several vectors per item. Graph-based two-tower models propagate information through the user-item interaction graph before the final dot product. Cross-tower interaction layers insert a small amount of attention between the towers' intermediate representations while keeping the final score a dot product, recovering some cross-encoder accuracy without breaking offline indexing.

Notable production deployments

The two-tower architecture is the dominant retrieval design for almost all large-scale recommender and search systems.

Organization	Surface	Notes
Microsoft Bing	Web search	DSSM (2013) and successors for query-document semantic matching
Google YouTube	Video recommendations	logQ correction (Yi 2019) and Mixed Negative Sampling (Yang 2020)
Google Search	Web and image search	Neural matching and various retrieval components
Pinterest	Pins and ads retrieval	Large-scale learned retrieval with HNSW serving
TikTok	Video candidate generation	Feeds a cascade ranking pipeline
Meta (Facebook, Instagram)	Feed, ads, search, friends	Embedding-based retrieval (EBR)
Netflix	Title recommendations and search	Embedding-based candidate generation
Amazon	Product search and recommendations	Two-tower, 3-tower, and 4-tower variants
Spotify	Music and podcast recommendations	ANN candidate generation (originally with Annoy)
Twitter (X)	Timeline candidate generation	EBR for the For You timeline
LinkedIn	Job search and people recommendations	Two-tower over professional graphs
Hugging Face SentenceTransformers	Open source library	SBERT and many bi-encoder checkpoints

Use cases

The two-tower architecture is used across many applications:

Recommender systems. The candidate generation stage of almost every modern recommender uses a two-tower model to narrow billions of items to a few hundred candidates that a heavier ranker then scores.
Semantic search. Search engines use two-tower retrievers (often combined with BM25 in a hybrid configuration) to retrieve semantically related documents.
Retrieval Augmented Generation. RAG systems use dense passage retrieval to pull relevant passages and feed them to a large language model as additional context. This is the primary use of the two-tower model in modern generative AI.
Question answering, e-commerce product search, code search, entity linking, and biomedical similarity search. All of these map two related modalities into a shared space for fast similarity lookup.
Image-text and multimodal search. CLIP and successors train a two-tower model with one tower per modality, enabling cross-modal retrieval and zero-shot classification.

Limitations

Despite its popularity, the two-tower model has well-known limitations.

Limited interaction. Because the query and the item never interact except through a single inner product, the model cannot capture fine-grained conditional matching that a cross-encoder can. Conditional preferences (for example, whether a user prefers item A given recent interaction with item B) must be compressed into a single user vector.
Single vector bottleneck. Users have many interests and documents cover many topics. Multi-vector approaches such as ColBERT and multi-interest networks address this at the cost of larger indices.
Popularity and exposure bias. Models trained on logged interactions over-recommend popular items because they appear more often as both positives and negatives. logQ correction, mixed negative sampling, and exposure-aware losses help but do not fully solve the problem.
Long-tail and cold-start items. Items with few interactions get noisy embeddings. Content-only towers help but lose collaborative signal.
Embedding space staleness. Retraining requires careful synchronization between deployed query and item tower versions; Pinterest engineering has documented how version mismatches can drastically degrade retrieval quality.

Relationship to other architectures

The two-tower model sits in a broader family of representation-based retrieval architectures. Latent factor models such as matrix factorization are the simplest case: two embedding lookup tables trained so that the dot product approximates observed ratings. Cross-encoders sit at the opposite end with full attention and no precomputed item representations. Late interaction models like ColBERT live between the extremes. Generative retrieval (DSI, SEAL) skips the embedding step entirely and trains a seq2seq model to emit document identifiers directly. Hybrid retrieval combines a sparse retriever (BM25 or SPLADE) with a dense two-tower retriever via score fusion.

References

Huang, P., He, X., Gao, J., Deng, L., Acero, A., & Heck, L. (2013). *Learning Deep Structured Semantic Models for Web Search using Clickthrough Data.* Proceedings of the 22nd ACM International Conference on Information and Knowledge Management (CIKM '13), pp. 2333-2338.
Reimers, N., & Gurevych, I. (2019). *Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.* EMNLP-IJCNLP 2019, pp. 3982-3992.
Yi, X., Yang, J., Hong, L., Cheng, D., Heldt, L., Kumthekar, A., Zhao, Z., Wei, L., & Chi, E. (2019). *Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations.* RecSys 2019, pp. 269-277.
Yang, J., Yi, X., Cheng, D. Z., Hong, L., Li, Y., Wang, S. X., Xu, T., & Chi, E. H. (2020). *Mixed Negative Sampling for Learning Two-tower Neural Networks in Recommendations.* WWW 2020 Companion, pp. 441-447.
Karpukhin, V., Oğuz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., & Yih, W. (2020). *Dense Passage Retrieval for Open-Domain Question Answering.* EMNLP 2020, pp. 6769-6781.
Khattab, O., & Zaharia, M. (2020). *ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT.* SIGIR 2020, pp. 39-48.
Kusupati, A., et al. (2022). *Matryoshka Representation Learning.* NeurIPS 2022.
Radford, A., et al. (2021). *Learning Transferable Visual Models From Natural Language Supervision.* ICML 2021, pp. 8748-8763. (CLIP paper)
Johnson, J., Douze, M., & Jégou, H. (2019). *Billion-Scale Similarity Search with GPUs.* IEEE Transactions on Big Data, 7(3), pp. 535-547. (FAISS paper)
Guo, R., et al. (2020). *Accelerating Large-Scale Inference with Anisotropic Vector Quantization.* ICML 2020. (ScaNN paper)
Malkov, Y. A., & Yashunin, D. A. (2018). *Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs.* IEEE TPAMI, 42(4), pp. 824-836.
Xiong, L., et al. (2021). *Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval.* ICLR 2021. (ANCE paper)
Wang, J., et al. (2021). *Cross-Batch Negative Sampling for Training Two-Tower Recommenders.* SIGIR 2021, pp. 1632-1636.
Pinterest Engineering. (2022). *Establishing a Large Scale Learned Retrieval System at Pinterest.* Pinterest Engineering Blog.
Google Cloud. (2023). *Scaling deep retrieval with TensorFlow's two-tower architecture.* Google Cloud Blog.

Two-Tower Model

Architecture

The two towers

Scoring function

Tower implementations

History and key papers

DSSM (2013)

Sentence-BERT (2019)

YouTube two-tower retrieval (2019)

Dense Passage Retrieval (2020)

CLIP and multimodal two-tower models (2021)

Training

Loss functions

Negative sampling

Sampling bias correction

Multi-stage training

Serving and indexing

Offline indexing

Online retrieval

ANN libraries

Comparison with cross-encoders

Modern extensions

Late interaction (ColBERT)

Matryoshka embeddings

Scaling laws and large embedding models

Multi-interest, multi-vector, and graph extensions

Notable production deployments

Use cases

Limitations

Relationship to other architectures

See also

References

Improve this article

Related Articles

Multi-head Latent Attention

TikTok

GELU (Gaussian Error Linear Unit)

MTEB (Massive Text Embedding Benchmark)

Tower

Social Media

Two-Tower Model

Architecture

The two towers

Scoring function

Tower implementations

History and key papers

DSSM (2013)

Sentence-BERT (2019)

YouTube two-tower retrieval (2019)

Dense Passage Retrieval (2020)

CLIP and multimodal two-tower models (2021)

Training

Loss functions

Negative sampling

Sampling bias correction

Multi-stage training

Serving and indexing

Offline indexing

Online retrieval

ANN libraries

Comparison with cross-encoders

Modern extensions

Late interaction (ColBERT)

Matryoshka embeddings

Scaling laws and large embedding models

Multi-interest, multi-vector, and graph extensions

Notable production deployments

Use cases

Limitations

Relationship to other architectures

See also

References

Related Articles

Multi-head Latent Attention

TikTok

GELU (Gaussian Error Linear Unit)

MTEB (Massive Text Embedding Benchmark)

Tower

Social Media