The two-tower model, also known as the dual encoder, bi-encoder, or Siamese network for retrieval, is a neural network architecture that learns to map two related inputs (such as a query and a document, or a user and an item) into a shared low-dimensional vector space. Each input is processed by a separate sub-network, called a tower or encoder, and the relevance between any two inputs is computed by a simple similarity function such as the dot product or cosine similarity between their resulting embedding vectors.
Because the query and item towers are independent at inference, all item embeddings can be precomputed once and indexed in an approximate nearest neighbor (ANN) library such as FAISS, ScaNN, or HNSW. At serving time only the query passes through its tower, and a k-nearest neighbors lookup retrieves the most similar items in milliseconds even from corpora of hundreds of millions of candidates. This decoupling of encoding from scoring is why the two-tower model has become the de facto retrieval and candidate-generation architecture in modern web-scale recommender system and search stacks at Google, YouTube, Pinterest, TikTok, Netflix, Amazon, Meta, and Spotify.
In NLP the same architecture, applied to text passages, underpins the dense passage retrieval family of models that power the retrieval stage of retrieval augmented generation systems used with large language models.
A two-tower model consists of two encoder networks (sometimes sharing low-level lookups but not higher layers), trained jointly so that the inner product between their outputs reflects the semantic relationship between their inputs.
The query tower (also called the user tower or question encoder) maps the query input to a fixed-length vector. In a recommender system the input is typically the user identifier together with contextual features such as device, time of day, recent browsing history, and short-term session signals. In a search system it is the user-issued text. In QA it is the natural language question.
The item tower (also called the candidate, document, or passage encoder) maps an item to a vector of the same dimensionality. The input is a structured representation of the candidate document, video, product, song, advertisement, or knowledge passage, typically combined with metadata such as category, tags, language, freshness, and item identifier embeddings.
Both towers project into a shared embedding space, commonly 64 to 768 dimensions. The output vectors are usually L2 normalized so that the dot product becomes equivalent to cosine similarity.
Given a query embedding q and an item embedding i, the relevance score is computed by a similarity function. The dot product s(q, i) = qi is by far the most common choice because it is cheap, matches the maximum inner product search problem that ANN libraries optimize for, and aligns with the contrastive learning objectives used during training. Cosine similarity and Euclidean distance can be transformed into equivalent dot products after normalization.
The defining constraint of the architecture is that interaction between query and item only happens at this final scalar score: inside the towers the two inputs never see each other. This contrasts with a cross-encoder, where the query and document are concatenated into a single input sequence and every layer can attend across both. The constraint makes the two-tower model fast at serving but is also the source of its accuracy gap relative to cross-encoders.
The internals of each tower can be any neural network appropriate for the input modality: multi-layer perceptrons over sparse embedding lookups (the original DSSM and most large recommender systems), transformer encoders such as BERT, RoBERTa, or T5 for text (SBERT, DPR), vision transformers or CNNs for images (CLIP), and graph neural networks for graph-structured user-item data. When the same weights are used in both towers, the model is called a Siamese network; when towers have separate weights, dual encoder is more precise. In modern usage two-tower, dual encoder, and bi-encoder are typically interchangeable.
The two-tower idea was introduced in its modern form by Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck of Microsoft Research in the 2013 CIKM paper Learning Deep Structured Semantic Models for Web Search using Clickthrough Data. DSSM (Deep Structured Semantic Model) was designed to score the relevance between a web search query and a candidate document for Bing.
DSSM used two parallel feed-forward networks, one for the query and one for the document title, both starting with a word-hashing layer of about thirty thousand letter trigram units. Each tower then projected its vector through several fully connected layers down to a 128-dimensional dense embedding. The cosine similarity between the two embeddings was passed through a softmax over a positive document and a small set of randomly sampled negatives, and the model was trained to maximize the conditional likelihood of the clicked document given the query.
DSSM established three ideas that all subsequent two-tower work has built on: independent encoders for query and document, a similarity function applied only at the top, and contrastive training using clickthrough data as implicit relevance labels.
In 2019, Nils Reimers and Iryna Gurevych at TU Darmstadt published Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks at EMNLP-IJCNLP. The paper showed that BERT, although strong as a cross-encoder, was impractical for semantic similarity search because pairwise scoring over N sentences required O(N) BERT inferences per query (about sixty-five hours for ten thousand sentences).
Reimers and Gurevych wrapped BERT in a Siamese architecture: each sentence is encoded independently by the same BERT model, a pooling layer produces a fixed-length sentence embedding, and the two embeddings are scored by cosine similarity. Trained with a triplet or contrastive loss on natural language inference and semantic textual similarity datasets, SBERT preserved most of BERT's accuracy while reducing the same ten-thousand-sentence search to about five seconds. SBERT made the two-tower paradigm widely accessible and remains the basis of the popular sentence-transformers library.
At RecSys 2019, Xinyang Yi, Ji Yang, Lichan Hong, Derek Cheng, Lukasz Heldt, Aditee Kumthekar, Zhe Zhao, Li Wei, and Ed Chi from Google published Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations, describing how YouTube uses a two-tower model for candidate retrieval.
The paper studied the case where the candidate corpus contains hundreds of millions of items and the training signal is in-batch negatives drawn from a power-law distribution of impressions. The authors showed this causes the model to over-penalize popular items, hurting recall. They proposed a logQ correction in which the logit for each in-batch negative is reduced by the log of its sampling probability, restoring an unbiased sampled softmax estimator. The system was deployed in production for YouTube.
The follow-up paper Mixed Negative Sampling for Learning Two-tower Neural Networks in Recommendations (Yang et al., WWW 2020 Companion) introduced mixed negative sampling (MNS), combining in-batch negatives with negatives drawn uniformly from the corpus, which addresses selection bias because in-batch negatives can never include items that have never been impressed. MNS is now standard in industrial two-tower training.
In 2020, Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih (then at Facebook AI Research and the University of Washington) released dense passage retrieval in the paper Dense Passage Retrieval for Open-Domain Question Answering at EMNLP 2020.
DPR replaced the traditional BM25 retriever in the open-domain QA pipeline with a two-tower BERT model: one BERT encoder produced a 768-dimensional embedding for each Wikipedia passage and a second BERT encoder did the same for each question. The model was trained with a contrastive loss using one positive passage per question, in-batch positives of other questions as negatives, and one or two BM25 hard negatives.
On five open-domain QA benchmarks (Natural Questions, TriviaQA, WebQuestions, CuratedTREC, and SQuAD), DPR outperformed BM25 by nine to nineteen absolute percentage points in top-twenty passage retrieval accuracy. The architecture, contrastive training recipe, and BM25 hard negative trick became the template for almost every modern dense retriever, including E5, GTE, BGE, Contriever, and the OpenAI text-embedding family.
OpenAI's CLIP, released in 2021, generalized the two-tower paradigm to a different pair of modalities by training one tower over images and another over text captions on four hundred million image-text pairs from the web with an InfoNCE objective. The resulting shared image-text embedding space supports zero-shot image classification and text-to-image search. Models such as ALIGN, OpenCLIP, SigLIP, and EVA-CLIP have continued the same paradigm at larger scale.
The two-tower model is almost always trained with a contrastive learning objective. The model is shown a positive query-item pair and a set of negative items, and is asked to assign a higher score to the positive pair than to any of the negatives.
The most common training objective is the in-batch sampled softmax, also called the InfoNCE loss. For a mini-batch of N positive query-item pairs, the model treats the N items in the batch as a candidate set for each query. The loss for query q with positive item i is:
L(q, i) = -log(exp(s(q, i) / τ) / Σ exp(s(q, i') / τ))
where the sum runs over all N items in the batch and τ is an optional temperature. This is equivalent to a cross-entropy over an N-way classification problem. In-batch negatives are computationally free and enable very large effective batch sizes (often 1,024 to 16,384). Other common objectives include the pairwise BPR loss, the triplet margin loss popularized by FaceNet and Sentence-BERT, hinge loss, and explicit noise contrastive estimation (NCE).
The choice of how to construct negatives is one of the most important and most studied aspects of two-tower training. The dominant strategies are:
When in-batch negatives follow the empirical impression distribution, popular items dominate the negative pool. The logQ correction subtracts each item's log batch-probability from its logit, recovering an unbiased full-corpus softmax. Yi et al. 2019 implemented this with a streaming frequency estimator. The trick is essential when training on real-world traffic logs, otherwise the model learns near-uniform scores for popular items.
Large production systems often train in stages: self-supervised pretraining (such as masked language modeling), a contrastive warmup with in-batch negatives, hard negative fine-tuning, and a periodic refresh on fresh interactions.
The practical value of the two-tower model lies in how it splits cleanly between offline and online phases.
Once trained, the item tower is applied once to every item in the corpus to produce a fixed-length embedding. These embeddings are written to an approximate nearest neighbor index, partitioned across many machines for production-scale corpora and rebuilt periodically when new content arrives or the model is retrained. Indexing typically combines product quantization (PQ) to compress vectors, inverted file indexing (IVF) to cluster vectors and search only nearby clusters, Hierarchical Navigable Small World graphs (HNSW) for logarithmic-time graph traversal, and anisotropic vector quantization (ScaNN) which weights dimensions by their contribution to inner-product error.
At query time only the query tower runs, producing a single query vector. The vector is sent to the ANN index, which returns the top-K most similar items in single-digit milliseconds even when K is in the hundreds and the corpus is in the hundreds of millions. These items are then passed to a second-stage ranker (typically a cross-encoder, a deep ranking network with cross-features, or a tree ensemble) for fine scoring.
The table below summarizes widely used ANN libraries in two-tower deployments.
| Library | Maintainer | Year | Index families | Strengths |
|---|---|---|---|---|
| FAISS | Meta AI Research | 2017 | IVF, PQ, HNSW, IVF-PQ, OPQ | Mature CPU/GPU support, widely used in research |
| ScaNN | Google Research | 2020 | Anisotropic VQ, asymmetric hashing | State-of-the-art tradeoff for inner-product search |
| HNSWlib | Yury Malkov | 2018 | HNSW | Lightweight C++ header-only, fast in-memory |
| Annoy | Spotify | 2015 | Random projection trees | Simple file format, used historically at Spotify |
| Vespa | Vespa.ai (Yahoo) | 2017 | HNSW, brute force | Search engine with hybrid lexical-dense retrieval |
| Milvus | Zilliz | 2019 | IVF, HNSW, IVF-PQ, DiskANN | Distributed vector database |
| Pinecone | Pinecone | 2021 | IVF and graph hybrids | Managed cloud vector search |
| Qdrant | Qdrant | 2021 | HNSW with payload filters | Open-source vector DB in Rust |
| Weaviate | Weaviate | 2019 | HNSW with hybrid BM25 | Vector DB with built-in object schema |
The key architectural choice in modern neural retrieval is the bi-encoder versus cross-encoder tradeoff. A cross-encoder concatenates the query and a candidate into one input sequence, runs them through a transformer that attends across both, and outputs a relevance score. This enables much richer interaction modeling than the two-tower model, which only interacts via a final dot product, but it requires N forward passes per query for N candidates, which is impractical at retrieval scale.
| Aspect | Two-tower (bi-encoder) | Cross-encoder |
|---|---|---|
| Encoding | Query and item encoded independently | Query and item encoded jointly |
| Item embeddings | Precomputed and stored in ANN index | Cannot be precomputed |
| Scoring per query | Single forward pass + ANN lookup | One forward pass per candidate |
| Latency at corpus size N | O(1) plus sublinear ANN search | O(N) transformer forward passes |
| Accuracy | Strong but not optimal | Highest accuracy of mainstream architectures |
| Use case | Retrieval, candidate generation, semantic search at scale | Reranking small candidate set, classification, pairwise comparison |
| Typical model | DSSM, SBERT, DPR, E5, BGE | monoBERT, ms-marco-MiniLM-L-6-v2, Cohere Rerank, Cross-Encoder/CE-MiniLM |
| Training data efficiency | Often needs hard negative mining | Usually trains well with random negatives |
| Interaction modeling | Only dot product at the end | Full attention across query and document |
The near-universal solution in production search and QA is a retrieve-then-rerank pipeline. A two-tower retriever fetches the top one hundred to one thousand candidates, then a cross-encoder reranker rescores only those and returns the final top ten or twenty. This brings the candidate set from billions to hundreds in milliseconds, then spends the cross-encoder budget on a small set where its cost is tolerable, capturing most of the cross-encoder's accuracy at near bi-encoder latency.
The two-tower paradigm has been extended in many directions to push past its inherent limitations.
ColBERT, introduced by Omar Khattab and Matei Zaharia in 2020, sits between the two extremes. Each query and document is encoded independently into a bag of token-level embeddings rather than a single pooled vector. At query time, for each query token the maximum similarity over all document tokens is computed (the MaxSim operator), and the per-token maxima are summed into a final score. Because document bags are precomputed, ColBERT keeps the bi-encoder's offline indexing benefits while capturing fine-grained interactions, at the cost of much larger index size. ColBERTv2 added denoised supervision and residual compression to shrink the index toward single-vector retriever sizes.
Matryoshka Representation Learning (MRL), introduced by Aditya Kusupati and colleagues in 2022, modifies training so that the first d coordinates of every output vector are themselves a useful embedding for any d up to the full dimension. The model is trained with a multi-scale loss summing the contrastive loss over nested prefix sizes (for example 64, 128, 256, 512, 768). This allows cheap retrieval at low dimension followed by reranking at full dimension (called adaptive retrieval) and safe truncation. OpenAI's text-embedding-3-small and text-embedding-3-large use Matryoshka training.
Following LLM scaling laws, researchers have studied how dense retriever quality changes with model size, data size, and embedding dimension. Models such as E5-Mistral-7B, GritLM, NV-Embed, and the SFR-Embedding series use multi-billion parameter LLM backbones as the encoder, either distilled or used directly. These large two-tower models top the MTEB benchmark and have closed much of the gap with cross-encoders, though training and serving costs grow correspondingly.
Many follow-ups relax the single-vector bottleneck. Multi-interest user encoders (such as MIND from Alibaba) emit several user vectors instead of one, capturing different intents. Multi-vector item representations allocate several vectors per item. Graph-based two-tower models propagate information through the user-item interaction graph before the final dot product. Cross-tower interaction layers insert a small amount of attention between the towers' intermediate representations while keeping the final score a dot product, recovering some cross-encoder accuracy without breaking offline indexing.
The two-tower architecture is the dominant retrieval design for almost all large-scale recommender and search systems.
| Organization | Surface | Notes |
|---|---|---|
| Microsoft Bing | Web search | DSSM (2013) and successors for query-document semantic matching |
| Google YouTube | Video recommendations | logQ correction (Yi 2019) and Mixed Negative Sampling (Yang 2020) |
| Google Search | Web and image search | Neural matching and various retrieval components |
| Pins and ads retrieval | Large-scale learned retrieval with HNSW serving | |
| TikTok | Video candidate generation | Feeds a cascade ranking pipeline |
| Meta (Facebook, Instagram) | Feed, ads, search, friends | Embedding-based retrieval (EBR) |
| Netflix | Title recommendations and search | Embedding-based candidate generation |
| Amazon | Product search and recommendations | Two-tower, 3-tower, and 4-tower variants |
| Spotify | Music and podcast recommendations | ANN candidate generation (originally with Annoy) |
| Twitter (X) | Timeline candidate generation | EBR for the For You timeline |
| Job search and people recommendations | Two-tower over professional graphs | |
| Hugging Face SentenceTransformers | Open source library | SBERT and many bi-encoder checkpoints |
The two-tower architecture is used across many applications:
Despite its popularity, the two-tower model has well-known limitations.
The two-tower model sits in a broader family of representation-based retrieval architectures. Latent factor models such as matrix factorization are the simplest case: two embedding lookup tables trained so that the dot product approximates observed ratings. Cross-encoders sit at the opposite end with full attention and no precomputed item representations. Late interaction models like ColBERT live between the extremes. Generative retrieval (DSI, SEAL) skips the embedding step entirely and trains a seq2seq model to emit document identifiers directly. Hybrid retrieval combines a sparse retriever (BM25 or SPLADE) with a dense two-tower retriever via score fusion.