# Two-Tower Model

> Source: https://aiwiki.ai/wiki/two-tower_model
> Updated: 2026-06-23
> Categories: Information Retrieval, Machine Learning, Neural Networks
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

The **two-tower model**, also known as the **dual encoder**, **bi-encoder**, or **Siamese network for retrieval**, is a [neural network](/wiki/neural_network) architecture that encodes a query and a candidate item with two separate sub-networks (towers) into a shared vector space, then measures their relevance with a single dot product or cosine similarity between the resulting [embeddings](/wiki/vector_embeddings). Because the two towers never interact until that final score, item embeddings can be computed once, indexed for approximate nearest neighbor (ANN) search, and retrieved in milliseconds, which is why the two-tower model is the de facto candidate-generation architecture in web-scale [recommender system](/wiki/recommender_system) and [semantic search](/wiki/semantic_search) stacks at Google, YouTube, Pinterest, TikTok, Netflix, Amazon, Meta, and Spotify.

Each input is processed by its own encoder, and the relevance between any two inputs is computed by a simple similarity function such as the dot product or cosine similarity between their resulting [embedding](/wiki/embeddings) vectors. Because the query and item towers are independent at inference, all item embeddings can be precomputed once and indexed in an approximate nearest neighbor (ANN) library such as [FAISS](/wiki/faiss), ScaNN, or HNSW. At serving time only the query passes through its tower, and a [k-nearest neighbors](/wiki/k_nearest_neighbors) lookup retrieves the most similar items in milliseconds even from corpora of hundreds of millions of candidates. This decoupling of encoding from scoring is the source of the architecture's scalability.

In NLP the same architecture, applied to text passages, underpins the [dense passage retrieval](/wiki/dense_passage_retrieval) family of models that power the retrieval stage of [retrieval augmented generation](/wiki/retrieval_augmented_generation) systems used with large language models.[5]

## What problem does the two-tower model solve?

A two-tower model consists of two encoder networks (sometimes sharing low-level lookups but not higher layers), trained jointly so that the inner product between their outputs reflects the semantic relationship between their inputs. The core problem it solves is search at scale: scoring a query against billions of candidates is intractable if every candidate requires a fresh neural forward pass, so the two-tower design factors the model into a part that depends only on the query and a part that depends only on the item, allowing the item side to be fully precomputed.

### The two towers

The **query tower** (also called the user tower or question encoder) maps the query input to a fixed-length vector. In a recommender system the input is typically the user identifier together with contextual features such as device, time of day, recent browsing history, and short-term session signals. In a search system it is the user-issued text. In QA it is the natural language question.

The **item tower** (also called the candidate, document, or passage encoder) maps an item to a vector of the same dimensionality. The input is a structured representation of the candidate document, video, product, song, advertisement, or knowledge passage, typically combined with metadata such as category, tags, language, freshness, and item identifier embeddings.

Both towers project into a shared embedding space, commonly 64 to 768 dimensions. The output vectors are usually L2 normalized so that the dot product becomes equivalent to cosine similarity.

### Scoring function

Given a query embedding q and an item embedding i, the relevance score is computed by a similarity function. The dot product s(q, i) = qi is by far the most common choice because it is cheap, matches the maximum inner product search problem that ANN libraries optimize for, and aligns with the contrastive learning objectives used during training. Cosine similarity and Euclidean distance can be transformed into equivalent dot products after normalization.

The defining constraint of the architecture is that interaction between query and item only happens at this final scalar score: inside the towers the two inputs never see each other. This contrasts with a cross-encoder, where the query and document are concatenated into a single input sequence and every layer can attend across both. The constraint makes the two-tower model fast at serving but is also the source of its accuracy gap relative to cross-encoders.

### Tower implementations

The internals of each tower can be any neural network appropriate for the input modality: multi-layer perceptrons over sparse embedding lookups (the original DSSM and most large recommender systems), [transformer](/wiki/transformer) encoders such as BERT, RoBERTa, or T5 for text (SBERT, DPR), vision transformers or CNNs for images (CLIP), and graph neural networks for graph-structured user-item data. When the same weights are used in both towers, the model is called a **Siamese** network; when towers have separate weights, **dual encoder** is more precise. In modern usage two-tower, dual encoder, and bi-encoder are typically interchangeable.

## History and key papers

### DSSM (2013)

The two-tower idea was introduced in its modern form by Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck of Microsoft Research in the 2013 CIKM paper *Learning Deep Structured Semantic Models for Web Search using Clickthrough Data*. DSSM (Deep Structured Semantic Model) was designed to score the relevance between a web search query and a candidate document for Bing.[1]

DSSM used two parallel feed-forward networks, one for the query and one for the document title, both starting with a word-hashing layer of about thirty thousand letter-trigram units. The paper reports that this letter n-gram word hashing reduced input dimensionality about 16-fold with a collision rate of only 0.0044%.[1] Each tower then projected its vector through several fully connected layers down to a 128-dimensional dense embedding. The cosine similarity between the two embeddings was passed through a softmax over a positive document and a small set of randomly sampled negatives, and the model was trained to maximize the conditional likelihood of the clicked document given the query.[1]

DSSM established three ideas that all subsequent two-tower work has built on: independent encoders for query and document, a similarity function applied only at the top, and contrastive training using clickthrough data as implicit relevance labels.[1]

### Sentence-BERT (2019)

In 2019, Nils Reimers and Iryna Gurevych at TU Darmstadt published *Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks* at EMNLP-IJCNLP. The paper showed that BERT, although strong as a cross-encoder, was impractical for semantic similarity search because finding the most similar pair among ten thousand sentences requires about 50 million inference computations, roughly 65 hours of BERT compute.[2]

Reimers and Gurevych wrapped BERT in a Siamese architecture: each sentence is encoded independently by the same BERT model, a pooling layer produces a fixed-length sentence embedding, and the two embeddings are scored by cosine similarity. Trained with a triplet or contrastive loss on natural language inference and semantic textual similarity datasets, SBERT preserved most of BERT's accuracy while reducing the same ten-thousand-sentence search from about 65 hours to roughly 5 seconds.[2] SBERT made the two-tower paradigm widely accessible and remains the basis of the popular `sentence-transformers` library.

### YouTube two-tower retrieval (2019)

At RecSys 2019, Xinyang Yi, Ji Yang, Lichan Hong, Derek Cheng, Lukasz Heldt, Aditee Kumthekar, Zhe Zhao, Li Wei, and Ed Chi from Google published *Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations*, describing how YouTube uses a two-tower model to retrieve personalized candidates from a corpus of tens of millions of videos.[3]

The paper observed that training with in-batch negatives drawn from a power-law distribution of impressions introduces a sampling bias, noting that "in-batch items are usually sampled from a power-law distribution in our case."[3] This causes the model to over-penalize popular items and hurt recall. The authors proposed a *logQ correction* in which the logit for each in-batch negative is reduced by the log of its sampling probability, restoring an unbiased sampled softmax estimator, paired with a streaming algorithm that estimates item frequency without a fixed vocabulary. Live A/B experiments on YouTube reported a +0.37% engagement gain from the bias correction, and the neural retrieval system was deployed in production.[3]

The follow-up paper *Mixed Negative Sampling for Learning Two-tower Neural Networks in Recommendations* (Yang et al., WWW 2020 Companion) introduced **mixed negative sampling** (MNS), combining in-batch negatives with negatives drawn uniformly from the corpus, which addresses selection bias because in-batch negatives can never include items that have never been impressed.[4] MNS is now standard in industrial two-tower training.

### Dense Passage Retrieval (2020)

In 2020, Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih (then at Facebook AI Research and the University of Washington) released [dense passage retrieval](/wiki/dense_passage_retrieval) in the paper *Dense Passage Retrieval for Open-Domain Question Answering* at EMNLP 2020.[5]

DPR replaced the traditional BM25 retriever in the open-domain QA pipeline with a two-tower BERT model: one BERT encoder produced a 768-dimensional embedding for each Wikipedia passage and a second BERT encoder did the same for each question. The model was trained with a contrastive loss using one positive passage per question, in-batch positives of other questions as negatives, and one or two BM25 hard negatives.[5]

On five open-domain QA benchmarks (Natural Questions, TriviaQA, WebQuestions, CuratedTREC, and SQuAD), DPR outperformed BM25 by 9 to 19 absolute percentage points in top-20 passage retrieval accuracy.[5] The architecture, contrastive training recipe, and BM25 hard negative trick became the template for almost every modern dense retriever, including E5, GTE, BGE, Contriever, and the OpenAI text-embedding family.

### CLIP and multimodal two-tower models (2021)

OpenAI's CLIP, released in 2021, generalized the two-tower paradigm to a different pair of modalities by training one tower over images and another over text captions on four hundred million image-text pairs from the web with an InfoNCE objective. The resulting shared image-text embedding space supports zero-shot image classification and text-to-image search.[8] Models such as ALIGN, OpenCLIP, SigLIP, and EVA-CLIP have continued the same paradigm at larger scale.

## How is a two-tower model trained?

The two-tower model is almost always trained with a **contrastive learning** objective. The model is shown a positive query-item pair and a set of negative items, and is asked to assign a higher score to the positive pair than to any of the negatives.

### Loss functions

The most common training objective is the **in-batch sampled softmax**, also called the InfoNCE loss. For a mini-batch of N positive query-item pairs, the model treats the N items in the batch as a candidate set for each query. The loss for query q with positive item i is:

L(q, i) = -log(exp(s(q, i) / τ) / Σ exp(s(q, i') / τ))

where the sum runs over all N items in the batch and τ is an optional temperature. This is equivalent to a cross-entropy over an N-way classification problem. In-batch negatives are computationally free and enable very large effective batch sizes (often 1,024 to 16,384). Other common objectives include the pairwise BPR loss, the triplet margin loss popularized by FaceNet and Sentence-BERT, hinge loss, and explicit noise contrastive estimation (NCE).

### Negative sampling

The choice of how to construct negatives is one of the most important and most studied aspects of two-tower training. The dominant strategies are:

- **In-batch negatives:** treat the positive items of other queries in the same mini-batch as negatives. Free and known to scale well with batch size; used in DPR, SBERT, CLIP, and almost every modern dense retriever.
- **Mixed Negative Sampling (MNS):** combine in-batch negatives with uniformly sampled negatives from the corpus, introduced by Yang et al. 2020 to correct selection bias.[4]
- **BM25 hard negatives:** use the top documents returned by a lexical retriever that do not contain the answer; introduced in DPR to force discrimination between passages with strong lexical overlap.[5]
- **Mined hard negatives:** use a previous version of the dense retriever to find difficult negatives, the basis of the ANCE training procedure and most modern retrievers (E5, BGE, Contriever).[12]
- **Cross-batch negatives:** maintain a FIFO queue of recent embeddings from previous batches as additional negatives.[13]
- **Cross-encoder distillation:** distill a cross-encoder's score distribution into the bi-encoder via the MarginMSE loss, behind models such as msmarco-MiniLM and TAS-B.

### Sampling bias correction

When in-batch negatives follow the empirical impression distribution, popular items dominate the negative pool. The **logQ correction** subtracts each item's log batch-probability from its logit, recovering an unbiased full-corpus softmax. Yi et al. 2019 implemented this with a streaming frequency estimator.[3] The trick is essential when training on real-world traffic logs, otherwise the model learns near-uniform scores for popular items.

### Multi-stage training

Large production systems often train in stages: self-supervised pretraining (such as masked language modeling), a contrastive warmup with in-batch negatives, hard negative fine-tuning, and a periodic refresh on fresh interactions.

## How is a two-tower model served at inference?

The practical value of the two-tower model lies in how it splits cleanly between offline and online phases.

### Offline indexing

Once trained, the item tower is applied once to every item in the corpus to produce a fixed-length embedding. These embeddings are written to an [approximate nearest neighbor](/wiki/k_nearest_neighbors) index, partitioned across many machines for production-scale corpora and rebuilt periodically when new content arrives or the model is retrained. Indexing typically combines product quantization (PQ) to compress vectors, inverted file indexing (IVF) to cluster vectors and search only nearby clusters, Hierarchical Navigable Small World graphs (HNSW) for logarithmic-time graph traversal,[11] and anisotropic vector quantization (ScaNN) which weights dimensions by their contribution to inner-product error.[10]

### Online retrieval

At query time only the query tower runs, producing a single query vector. The vector is sent to the ANN index, which returns the top-K most similar items in single-digit milliseconds even when K is in the hundreds and the corpus is in the hundreds of millions. These items are then passed to a second-stage ranker (typically a cross-encoder, a deep ranking network with cross-features, or a tree ensemble) for fine scoring.

### ANN libraries

The table below summarizes widely used ANN libraries in two-tower deployments.

| Library | Maintainer | Year | Index families | Strengths |
| --- | --- | --- | --- | --- |
| FAISS | Meta AI Research | 2017 | IVF, PQ, HNSW, IVF-PQ, OPQ | Mature CPU/GPU support, widely used in research |
| ScaNN | Google Research | 2020 | Anisotropic VQ, asymmetric hashing | State-of-the-art tradeoff for inner-product search |
| HNSWlib | Yury Malkov | 2018 | HNSW | Lightweight C++ header-only, fast in-memory |
| Annoy | Spotify | 2015 | Random projection trees | Simple file format, used historically at Spotify |
| Vespa | Vespa.ai (Yahoo) | 2017 | HNSW, brute force | Search engine with hybrid lexical-dense retrieval |
| Milvus | Zilliz | 2019 | IVF, HNSW, IVF-PQ, DiskANN | Distributed vector database |
| Pinecone | Pinecone | 2021 | IVF and graph hybrids | Managed cloud vector search |
| Qdrant | Qdrant | 2021 | HNSW with payload filters | Open-source vector DB in Rust |
| Weaviate | Weaviate | 2019 | HNSW with hybrid BM25 | Vector DB with built-in object schema |

## How does the two-tower model differ from a cross-encoder?

The key architectural choice in modern neural retrieval is the bi-encoder versus cross-encoder tradeoff. A **cross-encoder** concatenates the query and a candidate into one input sequence, runs them through a transformer that attends across both, and outputs a relevance score. This enables much richer interaction modeling than the two-tower model, which only interacts via a final dot product, but it requires N forward passes per query for N candidates, which is impractical at retrieval scale.

| Aspect | Two-tower (bi-encoder) | Cross-encoder |
| --- | --- | --- |
| Encoding | Query and item encoded independently | Query and item encoded jointly |
| Item embeddings | Precomputed and stored in ANN index | Cannot be precomputed |
| Scoring per query | Single forward pass + ANN lookup | One forward pass per candidate |
| Latency at corpus size N | O(1) plus sublinear ANN search | O(N) transformer forward passes |
| Accuracy | Strong but not optimal | Highest accuracy of mainstream architectures |
| Use case | Retrieval, candidate generation, semantic search at scale | Reranking small candidate set, classification, pairwise comparison |
| Typical model | DSSM, SBERT, DPR, E5, BGE | monoBERT, ms-marco-MiniLM-L-6-v2, Cohere Rerank, Cross-Encoder/CE-MiniLM |
| Training data efficiency | Often needs hard negative mining | Usually trains well with random negatives |
| Interaction modeling | Only dot product at the end | Full attention across query and document |

The near-universal solution in production search and QA is a **retrieve-then-rerank** pipeline. A two-tower retriever fetches the top one hundred to one thousand candidates, then a cross-encoder reranker rescores only those and returns the final top ten or twenty. This brings the candidate set from billions to hundreds in milliseconds, then spends the cross-encoder budget on a small set where its cost is tolerable, capturing most of the cross-encoder's accuracy at near bi-encoder latency.

## Modern extensions

The two-tower paradigm has been extended in many directions to push past its inherent limitations.

### Late interaction (ColBERT)

ColBERT, introduced by Omar Khattab and Matei Zaharia in 2020, sits between the two extremes. Each query and document is encoded independently into a bag of token-level embeddings rather than a single pooled vector. At query time, for each query token the maximum similarity over all document tokens is computed (the MaxSim operator), and the per-token maxima are summed into a final score.[6] Because document bags are precomputed, ColBERT keeps the bi-encoder's offline indexing benefits while capturing fine-grained interactions, at the cost of much larger index size. ColBERTv2 added denoised supervision and residual compression to shrink the index toward single-vector retriever sizes.

### Matryoshka embeddings

*Matryoshka Representation Learning* (MRL), introduced by Aditya Kusupati and colleagues in 2022, modifies training so that the first d coordinates of every output vector are themselves a useful embedding for any d up to the full dimension. The model is trained with a multi-scale loss summing the contrastive loss over nested prefix sizes (for example 64, 128, 256, 512, 768).[7] This allows cheap retrieval at low dimension followed by reranking at full dimension (called *adaptive retrieval*) and safe truncation. OpenAI's text-embedding-3-small and text-embedding-3-large use Matryoshka training.

### Scaling laws and large embedding models

Following LLM scaling laws, researchers have studied how dense retriever quality changes with model size, data size, and embedding dimension. Models such as E5-Mistral-7B, GritLM, NV-Embed, and the SFR-Embedding series use multi-billion parameter LLM backbones as the encoder, either distilled or used directly. These large two-tower models top the MTEB benchmark and have closed much of the gap with cross-encoders, though training and serving costs grow correspondingly.

### Multi-interest, multi-vector, and graph extensions

Many follow-ups relax the single-vector bottleneck. Multi-interest user encoders (such as MIND from Alibaba) emit several user vectors instead of one, capturing different intents. Multi-vector item representations allocate several vectors per item. Graph-based two-tower models propagate information through the user-item interaction graph before the final dot product. Cross-tower interaction layers insert a small amount of attention between the towers' intermediate representations while keeping the final score a dot product, recovering some cross-encoder accuracy without breaking offline indexing.

## Notable production deployments

The two-tower architecture is the dominant retrieval design for almost all large-scale recommender and search systems.

| Organization | Surface | Notes |
| --- | --- | --- |
| Microsoft Bing | Web search | DSSM (2013) and successors for query-document semantic matching |
| Google YouTube | Video recommendations | logQ correction (Yi 2019) and Mixed Negative Sampling (Yang 2020) |
| Google Search | Web and image search | Neural matching and various retrieval components |
| Pinterest | Pins and ads retrieval | Large-scale learned retrieval with HNSW serving |
| TikTok | Video candidate generation | Feeds a cascade ranking pipeline |
| Meta (Facebook, Instagram) | Feed, ads, search, friends | Embedding-based retrieval (EBR) |
| Netflix | Title recommendations and search | Embedding-based candidate generation |
| Amazon | Product search and recommendations | Two-tower, 3-tower, and 4-tower variants |
| Spotify | Music and podcast recommendations | ANN candidate generation (originally with Annoy) |
| Twitter (X) | Timeline candidate generation | EBR for the For You timeline |
| LinkedIn | Job search and people recommendations | Two-tower over professional graphs |
| Hugging Face SentenceTransformers | Open source library | SBERT and many bi-encoder checkpoints |

## What is the two-tower model used for?

The two-tower architecture is used across many applications:

- **Recommender systems.** The candidate generation stage of almost every modern recommender uses a two-tower model to narrow billions of items to a few hundred candidates that a heavier ranker then scores.
- **Semantic search.** Search engines use two-tower retrievers (often combined with BM25 in a hybrid configuration) to retrieve semantically related documents.
- **Retrieval Augmented Generation.** RAG systems use [dense passage retrieval](/wiki/dense_passage_retrieval) to pull relevant passages and feed them to a large language model as additional context. This is the primary use of the two-tower model in modern generative AI.
- **Question answering, e-commerce product search, code search, entity linking, and biomedical similarity search.** All of these map two related modalities into a shared space for fast similarity lookup.
- **Image-text and multimodal search.** CLIP and successors train a two-tower model with one tower per modality, enabling cross-modal retrieval and zero-shot classification.[8]

## Limitations

Despite its popularity, the two-tower model has well-known limitations.

- **Limited interaction.** Because the query and the item never interact except through a single inner product, the model cannot capture fine-grained conditional matching that a cross-encoder can. Conditional preferences (for example, whether a user prefers item A *given* recent interaction with item B) must be compressed into a single user vector.
- **Single vector bottleneck.** Users have many interests and documents cover many topics. Multi-vector approaches such as ColBERT and multi-interest networks address this at the cost of larger indices.[6]
- **Popularity and exposure bias.** Models trained on logged interactions over-recommend popular items because they appear more often as both positives and negatives. logQ correction, mixed negative sampling, and exposure-aware losses help but do not fully solve the problem.[3][4]
- **Long-tail and cold-start items.** Items with few interactions get noisy embeddings. Content-only towers help but lose collaborative signal.
- **Embedding space staleness.** Retraining requires careful synchronization between deployed query and item tower versions; Pinterest engineering has documented how version mismatches can drastically degrade retrieval quality.[14]

## Relationship to other architectures

The two-tower model sits in a broader family of representation-based retrieval architectures. Latent factor models such as matrix factorization are the simplest case: two embedding lookup tables trained so that the dot product approximates observed ratings. Cross-encoders sit at the opposite end with full attention and no precomputed item representations. Late interaction models like ColBERT live between the extremes.[6] Generative retrieval (DSI, SEAL) skips the embedding step entirely and trains a seq2seq model to emit document identifiers directly. Hybrid retrieval combines a sparse retriever (BM25 or SPLADE) with a dense two-tower retriever via score fusion.

## See also

- [Recommender system](/wiki/recommender_system)
- [Vector embeddings](/wiki/vector_embeddings)
- [Semantic search](/wiki/semantic_search)
- [Retrieval augmented generation](/wiki/retrieval_augmented_generation)
- [Dense passage retrieval](/wiki/dense_passage_retrieval)
- [FAISS](/wiki/faiss)
- [Embedding](/wiki/embeddings)
- [Neural network](/wiki/neural_network)
- [Transformer](/wiki/transformer)
- [k-nearest neighbors](/wiki/k_nearest_neighbors)

## References

1. Huang, P., He, X., Gao, J., Deng, L., Acero, A., & Heck, L. (2013). *Learning Deep Structured Semantic Models for Web Search using Clickthrough Data.* Proceedings of the 22nd ACM International Conference on Information and Knowledge Management (CIKM '13), pp. 2333-2338.
2. Reimers, N., & Gurevych, I. (2019). *Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.* EMNLP-IJCNLP 2019, pp. 3982-3992.
3. Yi, X., Yang, J., Hong, L., Cheng, D., Heldt, L., Kumthekar, A., Zhao, Z., Wei, L., & Chi, E. (2019). *Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations.* RecSys 2019, pp. 269-277.
4. Yang, J., Yi, X., Cheng, D. Z., Hong, L., Li, Y., Wang, S. X., Xu, T., & Chi, E. H. (2020). *Mixed Negative Sampling for Learning Two-tower Neural Networks in Recommendations.* WWW 2020 Companion, pp. 441-447.
5. Karpukhin, V., Oğuz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., & Yih, W. (2020). *Dense Passage Retrieval for Open-Domain Question Answering.* EMNLP 2020, pp. 6769-6781.
6. Khattab, O., & Zaharia, M. (2020). *ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT.* SIGIR 2020, pp. 39-48.
7. Kusupati, A., et al. (2022). *Matryoshka Representation Learning.* NeurIPS 2022.
8. Radford, A., et al. (2021). *Learning Transferable Visual Models From Natural Language Supervision.* ICML 2021, pp. 8748-8763. (CLIP paper)
9. Johnson, J., Douze, M., & Jégou, H. (2019). *Billion-Scale Similarity Search with GPUs.* IEEE Transactions on Big Data, 7(3), pp. 535-547. (FAISS paper)
10. Guo, R., et al. (2020). *Accelerating Large-Scale Inference with Anisotropic Vector Quantization.* ICML 2020. (ScaNN paper)
11. Malkov, Y. A., & Yashunin, D. A. (2018). *Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs.* IEEE TPAMI, 42(4), pp. 824-836.
12. Xiong, L., et al. (2021). *Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval.* ICLR 2021. (ANCE paper)
13. Wang, J., et al. (2021). *Cross-Batch Negative Sampling for Training Two-Tower Recommenders.* SIGIR 2021, pp. 1632-1636.
14. Pinterest Engineering. (2022). *Establishing a Large Scale Learned Retrieval System at Pinterest.* Pinterest Engineering Blog.
15. Google Cloud. (2023). *Scaling deep retrieval with TensorFlow's two-tower architecture.* Google Cloud Blog.