See also: two-tower model, dual encoder, cross-encoder, contrastive learning, embedding, recommendation system, information retrieval
In modern deep learning, a tower is a self-contained sub-network inside a larger model that processes one specific input or feature group and produces an output (usually an embedding) that is later combined with the outputs of other towers. The word originally just meant a tall column of stacked layers, but in current practice the term shows up most often in three places: the two-tower model used for retrieval and recommendation, multi-tower designs in industrial recommenders such as Wide & Deep and DLRM, and the original two-GPU "tower" split in AlexNet that introduced practical model parallelism.
A tower is therefore less a specific architecture than an organizing pattern: pick the inputs that belong together, give them their own stack of layers, and combine the resulting representations later, often through a dot product, a small interaction MLP, or a concatenation. Decoupling inputs into separate towers gives engineers control over which parts of the model can be precomputed, which run online, and which can scale independently on different hardware.
Note on terminology: older sources sometimes use "tower" for the parallel branches inside an Inception module, and a few textbooks have stretched that to cover ResNet skip connections or DenseNet's dense connections. That framing is largely obsolete. The dominant modern meaning is the dual-encoder pattern, with the recommender and model-parallelism uses close behind. The Inception sense survives only as a historical footnote, covered briefly in a later section.
Imagine you and a friend each climb a separate tower. You both look at the world from the top of your tower, then you each shout a short summary down to the ground. A judge on the ground compares the two summaries and decides whether you were looking at the same thing. The towers do not talk to each other while you climb. They only meet at the bottom when the summaries are compared.
That is exactly how a two-tower model works. One tower reads a question, another tower reads a possible answer, each tower produces a small list of numbers, and the system checks how similar the two lists are. The towers can be very tall (lots of layers) or short (few layers), and they can even have different shapes, because they only need to agree on the size of the summary they shout out.
The two-tower model, also called a dual encoder or bi-encoder, is the most common modern use of the word "tower." Two neural networks are trained jointly so that the embedding produced by the query tower sits close in vector space to the embedding produced by the item tower when the two inputs are a relevant pair, and far apart when they are not. Similarity is usually measured by a dot product or cosine similarity.
The two towers do not share weights in the general case, although weight sharing (a Siamese setup) is common when both inputs are of the same type, for example two pieces of text. Each tower can be any encoder: a BERT stack for text, a ViT or ResNet for images, an MLP over user features, or a graph network over an item catalog. The only constraint is that both towers project into a shared embedding space of the same dimensionality, typically between 64 and 1024 dimensions.
At training time, batches of (query, positive item) pairs are fed in. The model is asked to identify the correct item for each query out of all items in the batch (in-batch negatives) and sometimes additional sampled hard negatives. The loss is contrastive: usually a softmax cross-entropy over the in-batch similarity matrix, equivalent to the InfoNCE objective. See contrastive learning for the full loss family.
The defining advantage of the two-tower design is that the item tower can be run offline. Embeddings for the entire item catalog (web pages, products, videos, passages, ad creatives) are computed once and stored. At serving time only the query tower runs, the query embedding is produced in milliseconds, and an approximate nearest neighbor (ANN) index such as FAISS, ScaNN, or HNSW returns the top-K items by inner product. This makes two-tower retrieval feasible at the scale of billions of items, which is why almost every large web search and recommendation system uses some variant.
The trade-off is that the towers cannot see each other during encoding. A query token cannot directly attend to a document token. All interaction happens at the very end, through a single dot product. That makes two-tower models excellent first-stage retrievers but weaker than cross-encoders at fine-grained ranking. The standard production recipe is two-tower retrieval followed by a heavier cross-encoder reranker on the top few hundred candidates.
| Model | Year | Authors | Towers | Use case |
|---|---|---|---|---|
| DSSM | 2013 | Huang et al., Microsoft | Query + document MLPs over word-hash features | Web search ranking |
| Sentence-BERT | 2019 | Reimers & Gurevych | Two BERT encoders (Siamese, weight-shared) | Sentence similarity, semantic search |
| YouTube two-tower retriever | 2019 | Yi et al., Google | User tower + video tower with sampling-bias correction | Candidate generation for YouTube recommendations |
| DPR | 2020 | Karpukhin et al., Facebook AI | Question BERT + passage BERT, separate weights | Open-domain question answering |
| ANCE | 2020 | Xiong et al., Microsoft | BERT-Siamese with asynchronously refreshed hard negatives | Dense passage retrieval |
| CLIP | 2021 | Radford et al., OpenAI | Image encoder (ViT or ResNet) + text Transformer | Multimodal retrieval, zero-shot classification |
DSSM is generally credited as the first deep two-tower model. It mapped queries and documents into a 128-dimensional space using a word-hashing input layer and three fully connected layers per side, trained on Bing click-through data. Sentence-BERT brought the same idea to the BERT era and famously cut sentence-pair similarity search on a 10,000-sentence corpus from about 65 hours with vanilla BERT to about 5 seconds. DPR did the same for open-domain QA, beating BM25 by 9 to 19 absolute points in top-20 retrieval accuracy on Natural Questions. CLIP took the pattern multimodal: trained on 400 million image-text pairs, it produced zero-shot image classifiers competitive with fully supervised models on dozens of benchmarks.
ColBERT (Khattab & Zaharia, 2020) sits between two-tower retrieval and cross-encoder reranking. Each tower still encodes its input independently, but instead of producing a single vector, ColBERT keeps one embedding per token. Similarity is computed by summing, for every query token, the maximum dot product with any document token (the MaxSim operator). The interaction is therefore "late" but multi-vector, which preserves much of the cross-encoder's expressiveness while keeping documents precomputable. ColBERT runs about two orders of magnitude faster than a BERT cross-encoder per query.
Industrial recommendation systems often go beyond two towers. Different feature groups (user history, user demographics, item attributes, context) are routed through their own tower, and the towers' outputs are then combined for the final prediction. The classic examples are:
| Model | Year | Authors | Tower structure | Combination |
|---|---|---|---|---|
| Wide & Deep | 2016 | Cheng et al., Google | Wide linear model + deep MLP tower over embeddings | Joint training, sigmoid on summed logits |
| Deep & Cross (DCN) | 2017 | Wang et al., Google | Cross-network tower + deep MLP tower | Concatenation, then prediction layer |
| DLRM | 2019 | Naumov et al., Facebook | Dense feature MLP tower + many sparse-feature embedding towers | Pairwise dot products of all tower outputs, fed to a top MLP |
| Two-tower retrieval (YouTube) | 2019 | Yi et al., Google | User tower + item tower | Dot product, ANN search at serving time |
Wide & Deep is the canonical case. The wide component memorizes specific feature interactions (e.g., "users who installed app A also install app B"), while the deep tower learns to generalize across embeddings. Google reported a 3.9% online acquisition gain over a deep-only model on Google Play with over a billion active users.
DLRM, open-sourced by Facebook in 2019, is multi-tower in a different sense: every categorical feature gets its own embedding table acting as a small tower, and a separate dense-feature MLP processes the continuous inputs. The dot products of all pairs of these vectors are concatenated with the dense-feature output and passed to a top MLP. Each embedding table is sharded across devices through model parallelism while the MLPs run with data parallelism, an arrangement specifically chosen to handle Facebook's massive embedding tables without overflowing GPU memory.
The pattern is the same across all these systems: separate towers for separate kinds of information, combined cheaply at the top. This both improves modeling quality (different feature types want different inductive biases) and keeps the computation tractable at production scale.
The oldest use of "tower" in deep learning predates retrieval entirely. The original AlexNet (Krizhevsky, Sutskever & Hinton, 2012) was trained on two NVIDIA GTX 580 GPUs, each with only 3 GB of memory. The full network did not fit on one card, so the authors split the convolutional kernels in half and ran them as two parallel "towers," with cross-tower connections only at certain layers (the input to layer 3 mixed both towers, but layer 4 only saw the same tower's layer 3 output). This was one of the first practical demonstrations of model parallelism, and the AlexNet paper explicitly uses the word "tower" to describe each GPU-resident half of the network. The two-GPU split improved top-5 error by about 1.2 points compared to a half-sized one-GPU baseline.
Later frameworks for distributed training adopted the same term. In TensorFlow 1.x, the standard pattern for synchronous data parallelism was a loop that built one "tower" per GPU and averaged the gradients across them; the helper was literally named replicate_model_fn. The terminology has faded with newer APIs such as tf.distribute.Strategy and PyTorch DDP, but it still shows up in older codebases and in NVIDIA reference workloads.
In the GoogLeNet paper (Szegedy et al., 2015), the parallel 1x1, 3x3, and 5x5 convolution branches inside an Inception module were occasionally called towers, mostly in figure captions. The label never caught on; ResNet's residual blocks and DenseNet's dense connections are not normally called towers in any modern paper or textbook. As a rough rule, if you see "tower" in a paper from before about 2017 it usually means either an Inception-style branch or an AlexNet-style GPU split. In anything from 2019 onward it almost always means a sub-network in a dual-encoder or multi-tower recommender.
Because the two-tower idea is so closely tied to retrieval, it is worth contrasting it directly with the cross-encoder alternative.
| Property | Two-tower (bi-encoder) | Cross-encoder |
|---|---|---|
| Inputs encoded together? | No, each input goes through its own tower | Yes, the pair is concatenated and fed to one Transformer |
| Item embeddings precomputable? | Yes, fully offline | No, every (query, item) pair must run end to end |
| Latency at scale | Milliseconds with an ANN index | Seconds to minutes for a large corpus |
| Quality on fine-grained ranking | Lower; no token-level interaction | Higher; full attention between query and item |
| Typical role | First-stage retriever over millions to billions of items | Reranker over the top few hundred candidates |
| Training data needed | Tens of thousands of labeled or click-derived pairs | Same, often the same dataset reused |
The cross-encoder is more accurate per query but quadratic in pair evaluation, so production stacks almost always pair them: the bi-encoder narrows the candidate pool, the cross-encoder picks the winners.
Two-tower and multi-tower architectures show up wherever a query has to be compared against a very large catalog in milliseconds. Common settings include web search candidate generation (DSSM at Bing, dual encoders at Google), open-domain question answering (DPR, ANCE, RocketQA), semantic search with the sentence-transformers family, recommendation candidate generation at YouTube and Pinterest, multimodal retrieval and zero-shot classification (CLIP, ALIGN, SigLIP), product matching in e-commerce catalogs, code search using CodeBERT-style dual encoders, and near-duplicate detection in content moderation pipelines.
The core limitation is the same one that makes two-tower models so fast: the towers do not see each other. A query token cannot attend to a document token before pooling, which means subtle interactions (negation, numeric matching, exact entity overlap) are hard to capture. Empirically, single-vector dual encoders also struggle with very long documents and with queries that require multi-hop reasoning.
A second issue is bias from in-batch negatives. When training data follows a power-law distribution, common items appear as negatives for almost every query, which pushes their embeddings into bad regions of space. Yi et al.'s sampling-bias correction at YouTube and the hard-negative mining in ANCE were both responses to this problem.
A third issue is embedding drift over time. Item embeddings are precomputed; if the item tower is retrained without rebuilding the index, queries and items end up in slightly different spaces. Production systems usually re-encode the catalog on a daily or weekly cadence and version their embedding spaces so that ANN indexes can be swapped atomically.
None of these are dealbreakers. They are the reason every large retrieval stack is multi-stage rather than a single tower pair.