ColBERT

Information Retrieval Natural Language Processing

22 min read

Updated Jun 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 23, 2026

Fact-checked

In review queue

Sources

22 citations

Revision

v3 · 4,357 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

ColBERT (Contextualized Late Interaction over BERT) is a neural information retrieval model that encodes queries and documents into matrices of token-level vectors and scores them with a late-interaction operator called MaxSim, summing the maximum similarity between each query token and any document token.^[1] It was introduced by Omar Khattab and Matei Zaharia at SIGIR 2020 and is the founding model of the "late interaction" family of retrievers, occupying a middle ground between single-vector dense retrievers, which compress a whole passage into one embedding, and BERT cross-encoders, which jointly encode the query and document at query time.^[1] The original paper reports that ColBERT matches BERT-based rerankers on MS MARCO while "executing two orders-of-magnitude faster and requiring four orders-of-magnitude fewer FLOPs per query."^[1] Subsequent releases, ColBERTv2 (NAACL 2022) and the PLAID engine (CIKM 2022), added residual compression, centroid-based indexing, and an optimized retrieval pipeline that brought ColBERT-style search into millisecond-scale production use; ColBERTv2 reaches 39.7% MRR@10 on the MS MARCO passage Dev set.^[2]^[3]^[16] By 2024 the family had become a standard recipe for late-interaction retrieval, with the official Stanford-FutureData implementation, the RAGatouille library by Benjamin Clavié, and integrations in vector engines such as Vespa and Qdrant.^[4]^[5]^[6]

Field	Value
Original authors	Omar Khattab, Matei Zaharia^[1]
First publication	SIGIR 2020 (arXiv 2004.12832, April 27, 2020)^[1]
Affiliation	Stanford University, FutureData group^[4]
Latest research version	ColBERTv2 (NAACL 2022) with the PLAID engine (CIKM 2022)^[2]^[3]
Reference implementation	github.com/stanford-futuredata/ColBERT (MIT-licensed)^[4]
Embedding dimension	128 per token^[7]
Scoring operator	MaxSim (sum of per-query-token max similarities)^[1]
ColBERTv2 MS MARCO Dev MRR@10	39.7%^[16]
Reported speedup vs BERT reranker	~100x faster, ~10,000x fewer FLOPs per query^[1]

What problem does ColBERT solve?

By 2019, BERT cross-encoders had become the de facto re-ranking models for passage retrieval, producing large quality gains over lexical baselines like BM25 on benchmarks such as MS MARCO and TREC.^[1] Cross-encoders concatenate the query and the candidate document and run a full transformer forward pass per pair, which delivers strong relevance signals but scales poorly: every query forces a fresh joint encoding for each candidate document, ruling out direct first-stage retrieval over millions of passages.^[1] In parallel, dense bi-encoders such as DPR mapped queries and passages into a single vector each and relied on cosine or inner-product similarity, enabling efficient nearest-neighbor search but losing the fine-grained token-level signal that made cross-encoders so accurate.^[1]^[8]

Omar Khattab, then a Stanford Ph.D. student advised by Matei Zaharia and Christopher Potts, designed ColBERT to bridge that gap.^[9] The abstract frames the contribution as a model that "independently encodes the query and the document using BERT and then employs a cheap yet powerful interaction step that models their fine-grained similarity," so that documents can be encoded offline while preserving cross-encoder-like accuracy.^[1] Khattab's collaborators on the project included researchers from Stanford NLP and Stanford's FutureData systems group led by Zaharia, who had previously created Databricks and Apache Spark.^[4]^[9] The original paper was uploaded to arXiv on April 27, 2020 and presented at the 43rd ACM SIGIR conference in July 2020.^[1] After his Ph.D., Khattab spent time at Databricks as a research scientist before joining MIT EECS as an assistant professor and member of CSAIL.^[9]

When was ColBERT released, and how has the family evolved?

Year	Release	Key contribution
2020	ColBERT (v1)^[1]	Late interaction architecture, MaxSim operator, end-to-end retrieval over a FAISS index
2021	Plaid prototype / engineering work^[4]	Faster indexing in the stanford-futuredata repository
2021/2022	ColBERTv2 (arXiv 2112.01488, NAACL 2022)^[2]	Residual compression, denoised cross-encoder distillation, LoTTE benchmark
2022	PLAID (arXiv 2205.09707, CIKM 2022)^[3]	Centroid-interaction pruning, 7x GPU / 45x CPU speedup over vanilla ColBERTv2
2023	RAGatouille library^[5]	Three-line Python API around ColBERTv2 by Benjamin Clavié
2023	JaColBERT (arXiv 2312.16144)^[10]	Monolingual Japanese ColBERT trained on far less data
2024	Vespa ColBERT embedder^[6]	First-class integration of late interaction in a vector engine, with binary token compression
2024	ColPali (arXiv 2407.01449)^[11]	Multi-vector retrieval over document page images
2024	JaColBERTv2.5 and answerai-colbert-small^[12]^[13]	Stronger Japanese retriever; 33M-parameter English model from Answer.AI

How does ColBERT work?

ColBERT's central idea is to keep the per-token contextualized embeddings produced by BERT, and only interact them at query time through a cheap operator.^[1] The system has three components: a query encoder, a document encoder, and the late-interaction scorer.

Query and document encoders

Both encoders share weights and reuse a BERT backbone (BERT-base in the original paper).^[1] Queries are prepended with a special [Q] marker token; if a query is shorter than a fixed length, it is padded with [mask] tokens that the encoder is still allowed to contextualize, an idea the paper calls "query augmentation."^[1] Document inputs are prepended with [D]. After BERT, each token vector is projected to a much lower dimension (128 by default) and L2-normalized.^[7] A document of n tokens therefore becomes an n x 128 matrix; a query of Nq tokens becomes an Nq x 128 matrix.

Query augmentation is more than a padding trick: the additional [mask] positions act as extra "soft" terms that BERT can fill in with contextually relevant content, which helps with very short keyword-style queries by effectively expanding them.^[1] On the document side, punctuation tokens are filtered out of the stored representation, both to reduce index size and to avoid wasting MaxSim alignments on uninformative tokens.^[1]

Because the encoders are decoupled, document representations can be computed entirely offline and stored in an index, which removes per-query encoding cost for the corpus side.^[1] At search time only the query has to be encoded.

How does ColBERT's late interaction (MaxSim) work?

The late-interaction score for a query Q and document D is defined as:

score(Q, D) = Σ_i max_j  Q_i · D_j

For each query token embedding Q_i, the operator finds the document token D_j with which it has maximum similarity (cosine, since both sides are unit-normalized), then sums those per-token maxima across the query.^[1]^[7] This preserves a many-to-one alignment from query tokens to document tokens, similar in spirit to how a cross-encoder attends to relevant document spans, while remaining a pointwise operation that can be parallelized on GPUs.^[1]

ColBERT thus sits between bi-encoders and cross-encoders. A bi-encoder collapses D to a single vector and computes a single dot product; ColBERT keeps the matrix and aggregates with MaxSim; a cross-encoder runs full self-attention over Q and D together.^[7] The paper reports that ColBERT matches BERT-base in MRR@10 on MS MARCO while running roughly two orders of magnitude faster and using about four orders of magnitude fewer FLOPs per query than the cross-encoder at re-ranking time.^[1]

End-to-end retrieval in ColBERT v1

ColBERT also supports end-to-end retrieval over a corpus, not just re-ranking. Token-level embeddings for all documents are flattened into one large FAISS index; for each query token the index returns approximate nearest document tokens, and the union of source documents becomes the candidate set, which is then re-scored exactly by MaxSim.^[1] This was a notable departure from typical neural rerankers, which depended on a lexical first stage. The trade-off was index size: late-interaction indexes are roughly an order of magnitude larger than single-vector indexes because every passage stores hundreds of small vectors.^[14]

Training and objective

ColBERTv1 is trained pairwise: given a triple of a query, a relevant positive passage, and a sampled negative passage, the model is optimized so that the MaxSim score of the positive exceeds that of the negative under a pairwise softmax cross-entropy loss.^[1] Negatives in v1 are sampled from the official MS MARCO triples, which are themselves derived from BM25 candidates.^[1] Because the two encoders share weights, training only updates a single BERT backbone plus a small linear projection layer. The choice of an L2-normalized 128-dimensional projection means that token similarities are effectively cosine values in [-1, 1], and that the MaxSim sum has a bounded range proportional to the number of query tokens.^[7]

Computational profile

The original paper reported that ColBERT requires roughly four orders of magnitude fewer FLOPs per query than a BERT-base cross-encoder at re-ranking time, and that end-to-end retrieval over the MS MARCO 8.8M-passage corpus could be served at tens of milliseconds latency on a single GPU once the document representations were pre-computed.^[1] The 128-dimensional projection is essential: it cuts the per-token memory by roughly a factor of 6 compared with using BERT's native 768-dimensional output, with negligible quality cost.^[1]

What is ColBERTv2?

ColBERTv2 was uploaded to arXiv as 2112.01488 on December 2, 2021, by Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia.^[2] It was published at NAACL 2022 (Seattle).^[15] The paper attacks the two practical problems of v1: index size, and noisy training supervision. Its abstract states that ColBERTv2 "couples an aggressive residual compression mechanism with a denoised supervision strategy to simultaneously improve the quality and space footprint of late interaction," establishing "state-of-the-art quality within and outside the training domain while reducing the space footprint of late interaction models by 6-10x."^[2]

Residual compression

Rather than storing every 128-dimensional token vector in float16 or float32, ColBERTv2 clusters all training-corpus token vectors with k-means into a set of centroids, then represents each token as the index of its nearest centroid plus a residual vector quantized to one or two bits per dimension.^[14]^[16] The number of centroids scales roughly as the square root of the number of token embeddings (rounded to the nearest power of two).^[14] This residual compression shrinks the index by 6 to 10 times relative to vanilla ColBERT while preserving retrieval quality, dropping the MS MARCO index from roughly 154 GiB to 16-25 GiB.^[14]^[16]

Denoised supervision

ColBERTv1 was trained with simple BM25-sampled negatives, which can both miss relevant passages and introduce label noise.^[14] ColBERTv2 instead distills knowledge from a strong cross-encoder. The team retrieves the top-k candidates per query using a current ColBERT checkpoint, re-scores them with a 22M-parameter MiniLM cross-encoder, and trains the student to imitate the teacher's score distribution via a KL-divergence loss over a 64-way tuple of one positive and many hard negatives.^[14]^[16] Combined with the compression scheme, this distillation yields state-of-the-art quality on 22 of 28 benchmarks reported in the paper across MS MARCO, BEIR, Wikipedia Open-QA, and the new LoTTE evaluation introduced alongside the model.^[2]^[16]

Centroid-based indexing

The same centroids that drive residual compression are reused as an inverted-list structure: each centroid points to the list of token positions assigned to it.^[14] At query time, the system uses the centroid table to find which passages contain tokens close to each query token. ColBERTv2 reports MRR@10 of 39.7% on the MS MARCO Dev set, ahead of contemporary single-vector models such as SPLADEv2 and RocketQAv2 reported in the same paper.^[16]^[22]

LoTTE: the long-tail evaluation benchmark

Alongside ColBERTv2, the team released LoTTE (Long-Tail Topic-stratified Evaluation for IR), a new benchmark composed of 12 domain-specific search tests over StackExchange communities, with queries derived from GooAQ.^[16] LoTTE was designed to evaluate out-of-domain generalization on natural search-style queries about long-tail topics, complementing the prior BM25-centric MS MARCO setting and the BEIR suite. The paper reports that ColBERTv2 achieves up to 8% relative improvement over the strongest baselines on aggregate out-of-domain (BEIR plus LoTTE) tasks.^[16] LoTTE has since become a standard benchmark for late-interaction and dense retrieval work.

What is PLAID?

PLAID ("Performance-optimized Late Interaction Driver") was introduced by Keshav Santhanam, Omar Khattab, Christopher Potts, and Matei Zaharia in arXiv 2205.09707 (uploaded May 19, 2022) and presented at CIKM 2022.^[3] It is an engineering-focused redesign of the ColBERTv2 search path that keeps the same model and index format but accelerates query processing.

The central observation is that the residual-compressed index from ColBERTv2 already has a natural pruning signal: every passage can be viewed as a "bag of centroids," and similarity between a query and a centroid bag is cheap to compute without ever materializing the full token vectors.^[3] PLAID uses this centroid interaction mechanism, which the paper describes as treating "every passage as a lightweight bag of centroids," to discard low-scoring candidates before performing any decompression, then progressively refines the surviving candidate set with more accurate scoring.^[3] Only at the final stage does it actually decompress residuals and compute the exact MaxSim scores for the top passages.^[3]

The reported result is that PLAID reduces end-to-end latency by up to 7x on GPU and 45x on CPU compared with vanilla ColBERTv2, while preserving its retrieval quality.^[3] On a 140-million-passage Wikipedia collection, PLAID brings ColBERTv2 to "tens of milliseconds on a GPU and tens or just few hundreds of milliseconds on a CPU at large scale," even at the largest scales evaluated.^[3] PLAID is now the default search engine in the stanford-futuredata/ColBERT main branch.^[4]

Architectural details

PLAID operates in roughly four phases per query.^[3] First, the query is encoded into its multi-vector representation. Second, the engine looks up each query token's nearest centroids (configurable via the ncells parameter) to identify candidate passages whose centroid bags overlap with the query's. Third, an approximate score is computed using only centroid-level similarities, and only the top candidates above a configurable threshold proceed. Fourth, residuals are decompressed and exact MaxSim scoring is performed on the surviving top ndocs passages.^[3] This staged pruning yields a cascade where each successive stage is more expensive but operates on far fewer candidates, which is the source of PLAID's speedups.^[3] Reproducibility studies have shown that retrieval quality is sensitive to these hyperparameters, and that the published defaults are tuned for MS MARCO-like distributions.^[21]

How does ColBERT differ from dense retrievers and cross-encoders?

Paradigm	Representation	Query-time cost	Quality	Storage
Lexical (BM25)	Sparse term weights	Very low	Baseline	Small
Dense bi-encoder (e.g. DPR, Sentence-BERT)	One vector per passage	Very low (ANN over single vectors)	Good but information loss from pooling^[7]	Small
Late interaction (ColBERT)	One vector per token	Medium (MaxSim over candidates)	Strong, especially out-of-domain^[16]	Large (compressed in v2)
Cross-encoder reranker	Joint Q-D pass	Very high (linear in candidates)	Highest per pair^[1]	None (no index)

The trade-off is essentially that ColBERT spends extra memory to store per-token vectors, then recovers cross-encoder-like accuracy through MaxSim while still allowing index-based retrieval.^[1]^[7] Empirically, ColBERTv2 is competitive with or stronger than dense single-vector models such as those evaluated on the MTEB-style benchmarks for retrieval tasks, with particular gains on out-of-domain queries.^[16] The key practical distinction is where the query and document "meet": a bi-encoder never compares their tokens directly, a cross-encoder compares them with full attention at query time, and ColBERT compares them token-by-token with MaxSim after both sides are encoded independently, getting most of the cross-encoder's precision at a fraction of its cost.^[1]^[7]

What is ColBERT used for?

ColBERT's primary application is first-stage and second-stage passage retrieval in search and question answering pipelines.^[1]^[16] Because MaxSim scoring is more precise than single-vector cosine but still cheap enough to apply across thousands of candidates, ColBERT-family models are commonly used as:

Stronger first-stage retrievers when BM25 or a dense bi-encoder cannot capture nuanced multi-aspect queries.^[7]
Lightweight rerankers in place of full cross-encoders, when latency budgets do not allow a BERT-large cross-encoder pass over hundreds of candidates.^[1]
Components inside retrieval-augmented generation (RAG) systems, where libraries like RAGatouille drop ColBERTv2 indexes into LangChain and LlamaIndex pipelines.^[5]

The architecture also serves as a building block for the broader research direction of "compositional" retrieval systems exemplified by DSPy, the declarative LLM-programming framework co-created by Khattab, which uses ColBERT as one of its first-class retriever modules.^[9]

ColPali and its successors generalize the approach beyond text to documents represented as page images, enabling retrieval over PDFs, slides, and financial reports without text extraction.^[11] Long-context variants such as Vespa's LongColBERT extend the same machinery to multi-page documents by aggregating MaxSim scores over sliding windows, addressing the historical limit on how many tokens a single ColBERT pass could process.^[19] In open-domain QA, ColBERT-style retrievers have been used inside multi-hop and agentic pipelines that fetch evidence from large Wikipedia or web-scale corpora before passing it to an LLM.^[9]

Is ColBERT open source, and who maintains it?

Stanford-FutureData reference implementation

The canonical implementation lives at github.com/stanford-futuredata/ColBERT, distributed under the MIT license.^[4] The repository covers ColBERTv1 (on a legacy branch) and ColBERTv2 with PLAID as the main path, and includes training scripts, indexer, retriever, a JSON API server, and a published ColBERTv2 checkpoint trained on MS MARCO.^[4] The repository has more than 3,900 stars and lists papers from SIGIR'20, TACL'21, NeurIPS'21, NAACL'22, CIKM'22, ACL'23, and EMNLP'23 as part of the project's research output.^[4]

The same FutureData group at Stanford University, originally led by Matei Zaharia, has produced a broader stack of retrieval and LLM-systems software, including DSPy and related projects that consume ColBERT as a retriever module.^[4]^[9] The reference checkpoint, colbert-ir/colbertv2.0, is hosted on Hugging Face and serves as the default model for most downstream libraries.^[18]

RAGatouille

RAGatouille is a Python library created by Benjamin Clavié that wraps ColBERT for use in retrieval-augmented generation pipelines.^[5] Clavié describes the project's goal as closing "the growing gap between the Information Retrieval literature and everyday production retrieval uses," exposing ColBERT through a "3-lines API" while still allowing access to underlying parameters.^[5] The library covers indexing, fine-tuning (including automatic hard-negative mining), and inference on top of the stanford-futuredata implementation, and integrates with LangChain and LlamaIndex.^[5] RAGatouille is now maintained under the Answer.AI organization and is released under Apache 2.0.^[5] Benjamin Clavié credits RAGatouille with helping push ColBERTv2 Hugging Face downloads from about 50k to roughly 3M per month.^[17] By late 2024, ColBERTv2 was averaging around 5 million monthly downloads on Hugging Face.^[18]

Vespa

Vespa shipped a native ColBERT embedder in its serving engine on February 14, 2024.^[6] It exposes MaxSim as a Vespa ranking expression (max_sim) applied to the top candidates from a first phase, and adds an asymmetric binary compression that stores each 128-dimensional float document vector as 16 bytes of packed bits while keeping query vectors at full precision, claimed at 32x compression of ColBERT token embeddings.^[6] In a follow-up, Vespa announced a long-context ColBERT extension on March 1, 2024 that breaks long documents into sliding windows and aggregates per-context MaxSim scores, with reported nDCG@10 gains over single-vector baselines such as E5-Mistral and OpenAI's text-embedding-ada-002 on the MLDR long-document benchmark.^[19] Other vector engines, including Qdrant and Weaviate, have published multi-vector storage formats compatible with ColBERT-style indexes.^[11]

JaColBERT

JaColBERT is a family of Japanese-language ColBERT models released by Benjamin Clavié starting in late 2023.^[10] The first technical report (arXiv 2312.16144, "Towards Better Monolingual Japanese Retrievers with Multi-Vector Models," December 26, 2023) trains a 110M-parameter Japanese ColBERT on roughly two orders of magnitude less data than multilingual competitors, and reports outperforming all prior monolingual Japanese retrievers and surpassing multilingual models on out-of-domain tasks.^[10] JaColBERTv2.5 (arXiv 2407.20750, July 2024) refines the recipe with knowledge distillation and weight averaging, training on 3.2M triplets for 15 hours on four A100 GPUs and reaching new SoTA on Japanese retrieval benchmarks with just 110M parameters.^[12]

Answer.AI subsequently released answerai-colbert-small-v1, a 33-million-parameter English ColBERT trained with the JaColBERTv2.5 recipe and a weights-averaged base initialized from MiniLM, gte-small, and bge-small-en-v1.5 checkpoints. The model was released in August 2024 and is reported to outperform the original 110M-parameter ColBERTv2 on common benchmarks including LoTTE despite being roughly a quarter of the size.^[13]

ColPali and multimodal extensions

ColPali (arXiv 2407.01449, June 27, 2024) by Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, Céline Hudelot, and Pierre Colombo extends late interaction to document images.^[11] It uses a PaliGemma-3B vision-language model to encode page images into multi-vector representations, and scores them against text queries with a ColBERT-style MaxSim. The paper introduces the ViDoRe benchmark for visual document retrieval and reports that ColPali outperforms standard text-extraction pipelines while being simpler and end-to-end trainable.^[11] ColPali was accepted to ICLR 2025.^[11] Follow-up models in the same family include ColQwen2 and ColSmol from the illuin-tech repository.^[11]

The appeal of ColPali for document understanding is that it removes the brittle text-extraction step: PDFs, scanned forms, slides, and tables can be indexed directly as page images, with visual elements such as fonts, figures, and layout signaling relevance.^[11] Weaviate and Qdrant have published guides showing how to store ColPali multi-vector embeddings in their engines, and Vespa has separately demonstrated scaling ColPali to billion-page corpora.^[7]

What are ColBERT's limitations?

The principal cost of late interaction is index size. Even after ColBERTv2's compression, an MS MARCO index of roughly 9 million passages is on the order of 20 GiB, which is substantially larger than a single-vector index of the same corpus.^[14]^[16] For web-scale collections this overhead can dominate hardware costs, which is the explicit motivation for follow-up work on token pooling, projection variants, and multi-vector quantization.^[20]

Latency is also non-trivial. Vanilla ColBERTv2 can take hundreds of milliseconds per query at large scale, which is why PLAID and PLAID-style engines are needed in practice.^[3] Even with PLAID, end-to-end ColBERT search remains slower than a tuned dense bi-encoder serving from an HNSW or IVF index, and the engineering complexity is higher.^[3]^[20]

A separate critique is that the token-level interaction signal, while useful, is not free of "noise": MaxSim can be dominated by superficially matching tokens, and the lack of a single global representation makes some downstream tasks (e.g. straightforward semantic similarity between two passages) less natural to express than with single-vector embeddings.^[20] Subsequent work, including PyLate and ConstBERT, has explored variants of late interaction that pool or constrain the token vectors to address these concerns.^[21]

Reproducibility studies have also surfaced some sensitivity. A reproducibility paper on PLAID (2024) showed that the engine's quality-latency frontier depends carefully on hyperparameters such as ncells, ndocs, and threshold, and that off-the-shelf defaults are not always optimal across query distributions.^[21]

Finally, although ColBERTv2 is BERT-base sized at roughly 110M parameters, training a state-of-the-art ColBERT-style model still relies on a strong cross-encoder teacher (typically a distilled MiniLM or larger), an inverted-list construction over millions of token vectors, and careful hard-negative mining. This makes from-scratch training more involved than fine-tuning a single-vector encoder, and is part of the motivation for RAGatouille's higher-level training API and the small-but-mighty models such as answerai-colbert-small-v1 that aim to bring ColBERT-quality retrieval down to commodity-CPU latencies.^[5]^[13]

ColBERT belongs to a broader family of neural information retrieval models that includes BM25 for lexical baselines, DPR and Sentence-BERT for single-vector dense retrieval, and SPLADE for sparse neural retrieval (which learns to expand into the vocabulary as a sparse weighted bag of terms).^[7]^[8] ColBERT is often combined with these complementary representations in RAG pipelines via re-ranking or fusion.^[5] On the systems side, ColBERT indexes are increasingly stored in dedicated vector databases like Pinecone, Qdrant, and Weaviate, or alongside other ranking signals in engines such as Vespa.^[6]^[11]

References

Omar Khattab and Matei Zaharia, "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT", arXiv preprint, 2020-04-27. https://arxiv.org/abs/2004.12832. Accessed 2026-05-21. ↩
Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia, "ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction", arXiv preprint, 2021-12-02. https://arxiv.org/abs/2112.01488. Accessed 2026-05-21. ↩
Keshav Santhanam, Omar Khattab, Christopher Potts, and Matei Zaharia, "PLAID: An Efficient Engine for Late Interaction Retrieval", arXiv preprint, 2022-05-19. https://arxiv.org/abs/2205.09707. Accessed 2026-05-21. ↩
Stanford FutureData, "ColBERT: state-of-the-art neural search (SIGIR'20, TACL'21, NeurIPS'21, NAACL'22, CIKM'22, ACL'23, EMNLP'23)", GitHub repository, 2024. https://github.com/stanford-futuredata/ColBERT. Accessed 2026-05-21. ↩
Benjamin Clavié / Answer.AI, "RAGatouille: Easily use and train state of the art late-interaction retrieval methods (ColBERT) in any RAG pipeline", GitHub repository, 2024. https://github.com/AnswerDotAI/RAGatouille. Accessed 2026-05-21. ↩
Jo Kristian Bergum, "Announcing the Vespa ColBERT embedder", Vespa Blog, 2024-02-14. https://blog.vespa.ai/announcing-colbert-embedder-in-vespa/. Accessed 2026-05-21. ↩
Weaviate, "An Overview of Late Interaction Retrieval Models: ColBERT, ColPali, and ColQwen", Weaviate Blog, 2024. https://weaviate.io/blog/late-interaction-overview. Accessed 2026-05-21. ↩
Continuum Labs, "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT (summary)", training.continuumlabs.ai, 2024. https://training.continuumlabs.ai/knowledge/vector-databases/colbert-efficient-and-effective-passage-search-via-contextualized-late-interaction-over-bert. Accessed 2026-05-21. ↩
Omar Khattab, "Personal homepage", omarkhattab.com, 2025. https://omarkhattab.com/. Accessed 2026-05-21. ↩
Benjamin Clavié, "JaColBERT and Hard Negatives, Towards Better Japanese-First Embeddings for Retrieval: Early Technical Report" (later titled "Towards Better Monolingual Japanese Retrievers with Multi-Vector Models"), arXiv preprint 2312.16144, 2023-12-26. https://arxiv.org/abs/2312.16144. Accessed 2026-05-21. ↩
Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, Céline Hudelot, and Pierre Colombo, "ColPali: Efficient Document Retrieval with Vision Language Models", arXiv preprint 2407.01449, 2024-06-27. https://arxiv.org/abs/2407.01449. Accessed 2026-05-21. ↩
Benjamin Clavié, "JaColBERTv2.5: Optimising Multi-Vector Retrievers to Create State-of-the-Art Japanese Retrievers with Constrained Resources", arXiv preprint 2407.20750, 2024-07-30. https://arxiv.org/abs/2407.20750. Accessed 2026-05-21. ↩
Benjamin Clavié, "Small but Mighty: Introducing answerai-colbert-small", Answer.AI Blog, 2024-08-13. https://www.answer.ai/posts/2024-08-13-small-but-mighty-colbert.html. Accessed 2026-05-21. ↩
Keshav Santhanam et al., "ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction" (HTML version, technical details), ar5iv.labs.arxiv.org, 2022. https://ar5iv.labs.arxiv.org/html/2112.01488. Accessed 2026-05-21. ↩
ACL Anthology, "ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction (NAACL 2022, pages 3715-3734)", aclanthology.org, 2022. https://aclanthology.org/2022.naacl-main.272/. Accessed 2026-05-21. ↩
Emergent Mind, "ColBERTv2: Lightweight Late Interaction Retrieval (summary)", emergentmind.com, 2024. https://www.emergentmind.com/papers/2112.01488. Accessed 2026-05-21. ↩
Benjamin Clavié, "Personal homepage and About page", ben.clavie.eu, 2025. https://ben.clavie.eu/about/. Accessed 2026-05-21. ↩
Build Fast with AI, "RAGatouille: Smarter AI Retrieval Made Simple", buildfastwithai.com, 2024. https://www.buildfastwithai.com/blogs/what-is-ragatouille. Accessed 2026-05-21. ↩
Jo Kristian Bergum, "Announcing Vespa Long-Context ColBERT", Vespa Blog, 2024-03-01. https://blog.vespa.ai/announcing-long-context-colbert-in-vespa/. Accessed 2026-05-21. ↩
Emergent Mind, "ColBERT-Style Late Interaction (limitations and trade-offs)", emergentmind.com, 2024. https://www.emergentmind.com/topics/colbert-style-late-interaction. Accessed 2026-05-21. ↩
Sean MacAvaney et al., "A Reproducibility Study of PLAID", arXiv preprint 2404.14989, 2024-04. https://arxiv.org/html/2404.14989v1. Accessed 2026-05-21. ↩
Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia, "ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction" (NAACL 2022 camera-ready PDF, results tables), people.eecs.berkeley.edu, 2022. https://people.eecs.berkeley.edu/~matei/papers/2022/naacl_colbert_v2.pdf. Accessed 2026-06-23. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

BGE (BAAI General Embedding)Cross-encoder HyDE (Hypothetical Document Embeddings)Hybrid search Information Retrieval Qdrant Question Answering Models Re-ranking Reranker Retrieval-Augmented Generation SPLADE Tower Vespa (search engine)

What problem does ColBERT solve?

When was ColBERT released, and how has the family evolved?

How does ColBERT work?

Query and document encoders

How does ColBERT's late interaction (MaxSim) work?

End-to-end retrieval in ColBERT v1

Training and objective

Computational profile

What is ColBERTv2?

Residual compression

Denoised supervision

Centroid-based indexing

LoTTE: the long-tail evaluation benchmark

What is PLAID?

Architectural details

How does ColBERT differ from dense retrievers and cross-encoders?

What is ColBERT used for?

Is ColBERT open source, and who maintains it?

Stanford-FutureData reference implementation

RAGatouille

Vespa

JaColBERT

ColPali and multimodal extensions

What are ColBERT's limitations?

Related work

See also

References

Improve this article

Related Articles

Similarity Measure

Vector embeddings

LlamaIndex

AI search

Embeddings

Information Retrieval

What links here

Related Articles

Similarity Measure

Vector embeddings

LlamaIndex

AI search

Embeddings

Information Retrieval

What links here