Tower

Introduction

In modern deep learning, a tower is a self-contained sub-network inside a larger model that processes one specific input or feature group and produces an output (usually an embedding) that is later combined with the outputs of other towers. The word originally just meant a tall column of stacked layers, but in current practice the term shows up most often in three places: the two-tower model used for retrieval and recommendation, multi-tower designs in industrial recommenders such as Wide & Deep and DLRM, and the original two-GPU "tower" split in AlexNet that introduced practical model parallelism.

A tower is therefore less a specific architecture than an organizing pattern: pick the inputs that belong together, give them their own stack of layers, and combine the resulting representations later, often through a dot product, a small interaction MLP, or a concatenation. Decoupling inputs into separate towers gives engineers control over which parts of the model can be precomputed, which run online, and which can scale independently on different hardware.

Note on terminology: older sources sometimes use "tower" for the parallel branches inside an Inception module, and a few textbooks have stretched that to cover ResNet skip connections or DenseNet's dense connections. That framing is largely obsolete. The dominant modern meaning is the dual-encoder pattern, with the recommender and model-parallelism uses close behind. The Inception sense survives only as a historical footnote, covered briefly in a later section.

Explain like I'm 5

Imagine you and a friend each climb a separate tower. You both look at the world from the top of your tower, then you each shout a short summary down to the ground. A judge on the ground compares the two summaries and decides whether you were looking at the same thing. The towers do not talk to each other while you climb. They only meet at the bottom when the summaries are compared.

That is exactly how a two-tower model works. One tower reads a question, another tower reads a possible answer, each tower produces a small list of numbers, and the system checks how similar the two lists are. The towers can be very tall (lots of layers) or short (few layers), and they can even have different shapes, because they only need to agree on the size of the summary they shout out.

Two-tower models

The two-tower model, also called a dual encoder or bi-encoder, is the most common modern use of the word "tower." Two neural networks are trained jointly so that the embedding produced by the query tower sits close in vector space to the embedding produced by the item tower when the two inputs are a relevant pair, and far apart when they are not. Similarity is usually measured by a dot product or cosine similarity.

Architecture

The two towers do not share weights in the general case, although weight sharing (a Siamese setup) is common when both inputs are of the same type, for example two pieces of text. Each tower can be any encoder: a BERT stack for text, a ViT or ResNet for images, an MLP over user features, or a graph network over an item catalog. The only constraint is that both towers project into a shared embedding space of the same dimensionality, typically between 64 and 1024 dimensions.

At training time, batches of (query, positive item) pairs are fed in. The model is asked to identify the correct item for each query out of all items in the batch (in-batch negatives) and sometimes additional sampled hard negatives. The loss is contrastive: usually a softmax cross-entropy over the in-batch similarity matrix, equivalent to the InfoNCE objective. See contrastive learning for the full loss family.

Why people use it

The defining advantage of the two-tower design is that the item tower can be run offline. Embeddings for the entire item catalog (web pages, products, videos, passages, ad creatives) are computed once and stored. At serving time only the query tower runs, the query embedding is produced in milliseconds, and an approximate nearest neighbor (ANN) index such as FAISS, ScaNN, or HNSW returns the top-K items by inner product. This makes two-tower retrieval feasible at the scale of billions of items, which is why almost every large web search and recommendation system uses some variant.

The trade-off is that the towers cannot see each other during encoding. A query token cannot directly attend to a document token. All interaction happens at the very end, through a single dot product. That makes two-tower models excellent first-stage retrievers but weaker than cross-encoders at fine-grained ranking. The standard production recipe is two-tower retrieval followed by a heavier cross-encoder reranker on the top few hundred candidates.

Notable two-tower and dual-encoder models

Model	Year	Authors	Towers	Use case
DSSM	2013	Huang et al., Microsoft	Query + document MLPs over word-hash features	Web search ranking
Sentence-BERT	2019	Reimers & Gurevych	Two BERT encoders (Siamese, weight-shared)	Sentence similarity, semantic search
YouTube two-tower retriever	2019	Yi et al., Google	User tower + video tower with sampling-bias correction	Candidate generation for YouTube recommendations
DPR	2020	Karpukhin et al., Facebook AI	Question BERT + passage BERT, separate weights	Open-domain question answering
ANCE	2020	Xiong et al., Microsoft	BERT-Siamese with asynchronously refreshed hard negatives	Dense passage retrieval
CLIP	2021	Radford et al., OpenAI	Image encoder (ViT or ResNet) + text Transformer	Multimodal retrieval, zero-shot classification

DSSM is generally credited as the first deep two-tower model. It mapped queries and documents into a 128-dimensional space using a word-hashing input layer and three fully connected layers per side, trained on Bing click-through data. Sentence-BERT brought the same idea to the BERT era and famously cut sentence-pair similarity search on a 10,000-sentence corpus from about 65 hours with vanilla BERT to about 5 seconds. DPR did the same for open-domain QA, beating BM25 by 9 to 19 absolute points in top-20 retrieval accuracy on Natural Questions. CLIP took the pattern multimodal: trained on 400 million image-text pairs, it produced zero-shot image classifiers competitive with fully supervised models on dozens of benchmarks.

Late interaction: a hybrid

ColBERT (Khattab & Zaharia, 2020) sits between two-tower retrieval and cross-encoder reranking. Each tower still encodes its input independently, but instead of producing a single vector, ColBERT keeps one embedding per token. Similarity is computed by summing, for every query token, the maximum dot product with any document token (the MaxSim operator). The interaction is therefore "late" but multi-vector, which preserves much of the cross-encoder's expressiveness while keeping documents precomputable. ColBERT runs about two orders of magnitude faster than a BERT cross-encoder per query.

Multi-tower models in recommender systems

Industrial recommendation systems often go beyond two towers. Different feature groups (user history, user demographics, item attributes, context) are routed through their own tower, and the towers' outputs are then combined for the final prediction. The classic examples are:

Model	Year	Authors	Tower structure	Combination
Wide & Deep	2016	Cheng et al., Google	Wide linear model + deep MLP tower over embeddings	Joint training, sigmoid on summed logits
Deep & Cross (DCN)	2017	Wang et al., Google	Cross-network tower + deep MLP tower	Concatenation, then prediction layer
DLRM	2019	Naumov et al., Facebook	Dense feature MLP tower + many sparse-feature embedding towers	Pairwise dot products of all tower outputs, fed to a top MLP
Two-tower retrieval (YouTube)	2019	Yi et al., Google	User tower + item tower	Dot product, ANN search at serving time

Wide & Deep is the canonical case. The wide component memorizes specific feature interactions (e.g., "users who installed app A also install app B"), while the deep tower learns to generalize across embeddings. Google reported a 3.9% online acquisition gain over a deep-only model on Google Play with over a billion active users.

DLRM, open-sourced by Facebook in 2019, is multi-tower in a different sense: every categorical feature gets its own embedding table acting as a small tower, and a separate dense-feature MLP processes the continuous inputs. The dot products of all pairs of these vectors are concatenated with the dense-feature output and passed to a top MLP. Each embedding table is sharded across devices through model parallelism while the MLPs run with data parallelism, an arrangement specifically chosen to handle Facebook's massive embedding tables without overflowing GPU memory.

The pattern is the same across all these systems: separate towers for separate kinds of information, combined cheaply at the top. This both improves modeling quality (different feature types want different inductive biases) and keeps the computation tractable at production scale.

Towers in model parallelism

The oldest use of "tower" in deep learning predates retrieval entirely. The original AlexNet (Krizhevsky, Sutskever & Hinton, 2012) was trained on two NVIDIA GTX 580 GPUs, each with only 3 GB of memory. The full network did not fit on one card, so the authors split the convolutional kernels in half and ran them as two parallel "towers," with cross-tower connections only at certain layers (the input to layer 3 mixed both towers, but layer 4 only saw the same tower's layer 3 output). This was one of the first practical demonstrations of model parallelism, and the AlexNet paper explicitly uses the word "tower" to describe each GPU-resident half of the network. The two-GPU split improved top-5 error by about 1.2 points compared to a half-sized one-GPU baseline.

Later frameworks for distributed training adopted the same term. In TensorFlow 1.x, the standard pattern for synchronous data parallelism was a loop that built one "tower" per GPU and averaged the gradients across them; the helper was literally named replicate_model_fn. The terminology has faded with newer APIs such as tf.distribute.Strategy and PyTorch DDP, but it still shows up in older codebases and in NVIDIA reference workloads.

Towers in CNNs (historical note)

In the GoogLeNet paper (Szegedy et al., 2015), the parallel 1x1, 3x3, and 5x5 convolution branches inside an Inception module were occasionally called towers, mostly in figure captions. The label never caught on; ResNet's residual blocks and DenseNet's dense connections are not normally called towers in any modern paper or textbook. As a rough rule, if you see "tower" in a paper from before about 2017 it usually means either an Inception-style branch or an AlexNet-style GPU split. In anything from 2019 onward it almost always means a sub-network in a dual-encoder or multi-tower recommender.

Bi-encoders versus cross-encoders

Because the two-tower idea is so closely tied to retrieval, it is worth contrasting it directly with the cross-encoder alternative.

Property	Two-tower (bi-encoder)	Cross-encoder
Inputs encoded together?	No, each input goes through its own tower	Yes, the pair is concatenated and fed to one Transformer
Item embeddings precomputable?	Yes, fully offline	No, every (query, item) pair must run end to end
Latency at scale	Milliseconds with an ANN index	Seconds to minutes for a large corpus
Quality on fine-grained ranking	Lower; no token-level interaction	Higher; full attention between query and item
Typical role	First-stage retriever over millions to billions of items	Reranker over the top few hundred candidates
Training data needed	Tens of thousands of labeled or click-derived pairs	Same, often the same dataset reused

The cross-encoder is more accurate per query but quadratic in pair evaluation, so production stacks almost always pair them: the bi-encoder narrows the candidate pool, the cross-encoder picks the winners.

Use cases

Two-tower and multi-tower architectures show up wherever a query has to be compared against a very large catalog in milliseconds. Common settings include web search candidate generation (DSSM at Bing, dual encoders at Google), open-domain question answering (DPR, ANCE, RocketQA), semantic search with the sentence-transformers family, recommendation candidate generation at YouTube and Pinterest, multimodal retrieval and zero-shot classification (CLIP, ALIGN, SigLIP), product matching in e-commerce catalogs, code search using CodeBERT-style dual encoders, and near-duplicate detection in content moderation pipelines.

Limitations

The core limitation is the same one that makes two-tower models so fast: the towers do not see each other. A query token cannot attend to a document token before pooling, which means subtle interactions (negation, numeric matching, exact entity overlap) are hard to capture. Empirically, single-vector dual encoders also struggle with very long documents and with queries that require multi-hop reasoning.

A second issue is bias from in-batch negatives. When training data follows a power-law distribution, common items appear as negatives for almost every query, which pushes their embeddings into bad regions of space. Yi et al.'s sampling-bias correction at YouTube and the hard-negative mining in ANCE were both responses to this problem.

A third issue is embedding drift over time. Item embeddings are precomputed; if the item tower is retrained without rebuilding the index, queries and items end up in slightly different spaces. Production systems usually re-encode the catalog on a daily or weekly cadence and version their embedding spaces so that ANN indexes can be swapped atomically.

None of these are dealbreakers. They are the reason every large retrieval stack is multi-stage rather than a single tower pair.

References

Huang, P.-S., He, X., Gao, J., Deng, L., Acero, A., & Heck, L. (2013). Learning Deep Structured Semantic Models for Web Search using Clickthrough Data. *CIKM 2013*. https://posenhuang.github.io/papers/cikm2013_DSSM_fullversion.pdf
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. *NeurIPS 2012*. https://proceedings.neurips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
Szegedy, C., Liu, W., Jia, Y., et al. (2015). Going Deeper with Convolutions. *CVPR 2015*. (Original Inception/GoogLeNet paper.)
Cheng, H.-T., Koc, L., Harmsen, J., et al. (2016). Wide & Deep Learning for Recommender Systems. *DLRS workshop at RecSys 2016*. https://arxiv.org/abs/1606.07792
Wang, R., Fu, B., Fu, G., & Wang, M. (2017). Deep & Cross Network for Ad Click Predictions. *AdKDD 2017*. https://arxiv.org/abs/1708.05123
Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. *EMNLP-IJCNLP 2019*. https://aclanthology.org/D19-1410/
Yi, X., Yang, J., Hong, L., Cheng, D. Z., et al. (2019). Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations. *RecSys 2019*. https://research.google/pubs/sampling-bias-corrected-neural-modeling-for-large-corpus-item-recommendations/
Naumov, M., Mudigere, D., Shi, H.-J. M., et al. (2019). Deep Learning Recommendation Model for Personalization and Recommendation Systems. *arXiv:1906.00091*. https://arxiv.org/abs/1906.00091
Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., & Yih, W. (2020). Dense Passage Retrieval for Open-Domain Question Answering. *EMNLP 2020*. https://aclanthology.org/2020.emnlp-main.550/
Khattab, O., & Zaharia, M. (2020). ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. *SIGIR 2020*. https://arxiv.org/abs/2004.12832
Xiong, L., Xiong, C., Li, Y., et al. (2020). Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. *ICLR 2021*. https://arxiv.org/abs/2007.00808
Guo, R., Sun, P., Lindgren, E., Geng, Q., Simcha, D., Chern, F., & Kumar, S. (2020). Accelerating Large-Scale Inference with Anisotropic Vector Quantization. *ICML 2020* (ScaNN). http://proceedings.mlr.press/v119/guo20h/guo20h.pdf
Radford, A., Kim, J. W., Hallacy, C., et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. *ICML 2021* (CLIP). https://proceedings.mlr.press/v139/radford21a/radford21a.pdf

Introduction

Explain like I'm 5

Two-tower models

Architecture

Why people use it

Notable two-tower and dual-encoder models

Late interaction: a hybrid

Multi-tower models in recommender systems

Towers in model parallelism

Towers in CNNs (historical note)

Bi-encoders versus cross-encoders

Use cases

Limitations

References

Improve this article

Related Articles

TikTok

MTEB (Massive Text Embedding Benchmark)

Two-Tower Model

Social Media

Weighted Alternating Least Squares (WALS)

Netflix Prize

Introduction

Explain like I'm 5

Two-tower models

Architecture

Why people use it

Notable two-tower and dual-encoder models

Late interaction: a hybrid

Multi-tower models in recommender systems

Towers in model parallelism

Towers in CNNs (historical note)

Bi-encoders versus cross-encoders

Use cases

Limitations

References

Related Articles

TikTok

MTEB (Massive Text Embedding Benchmark)

Two-Tower Model

Social Media

Weighted Alternating Least Squares (WALS)

Netflix Prize