# HyDE (Hypothetical Document Embeddings)

> Source: https://aiwiki.ai/wiki/hyde
> Updated: 2026-06-24
> Categories: Information Retrieval, Natural Language Processing
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**HyDE** (Hypothetical Document Embeddings) is a zero-shot [dense retrieval](/wiki/dense_retrieval) technique that, instead of searching with the user's query, first prompts an instruction-following [large language model](/wiki/large_language_model) to write a hypothetical answer document for the query, then embeds that synthetic document and uses its vector to retrieve real documents. It was introduced in December 2022 by Luyu Gao, Xueguang Ma, Jimmy Lin, and Jamie Callan in the paper *Precise Zero-Shot Dense Retrieval without Relevance Labels*.[^1] The synthetic document is embedded with an unsupervised contrastive encoder such as Contriever, and that [embedding](/wiki/embeddings) vector drives nearest-neighbor search in the corpus.[^1] The method requires no relevance labels and no fine-tuned dual encoder, yet on TREC Deep Learning, BEIR, and Mr.TyDi benchmarks it reported nDCG@10 of 61.3 on TREC DL19 (versus 44.5 for the unsupervised baseline Contriever and 50.6 for lexical [BM25](/wiki/bm25)) and approached supervised, in-domain fine-tuned retrievers.[^1][^2] As the abstract summarises, "HyDE significantly outperforms the state-of-the-art unsupervised dense retriever Contriever and shows strong performance comparable to fine-tuned retrievers across various tasks (e.g. web search, QA, fact verification) and languages."[^1] HyDE became one of the most widely adopted query-side enhancements for [retrieval-augmented generation](/wiki/retrieval_augmented_generation) (RAG) systems, with first-party integrations in [LangChain](/wiki/langchain) and [LlamaIndex](/wiki/llamaindex).[^3][^4]

## Background

### What problem does HyDE solve?

Dense retrievers based on dual encoders, such as Dense Passage Retrieval ([DPR](/wiki/dense_passage_retrieval)), encode queries and passages into a shared vector space and retrieve passages by cosine or inner-product similarity.[^1] These models typically need substantial training on labelled query-document pairs (for example MS MARCO) to learn that space well, because the two encoders must agree on which surface forms of a question map to which surface forms of an answering passage.[^1] When the same encoder is moved to a new domain without further supervision, retrieval quality drops sharply, a weakness documented in the BEIR benchmark of Thakur et al., which collected nine heterogeneous retrieval tasks spanning fact verification, scientific question answering, financial QA, and argument retrieval.[^1] On many BEIR tasks, supervised dense retrievers actually underperformed lexical [BM25](/wiki/bm25), a result that motivated a wave of follow-up work on better zero-shot encoders.[^1]

### Earlier approaches

The two main classes of pre-HyDE responses were better self-supervised encoders and synthetic-data approaches. Contriever, introduced by Izacard et al. at Meta in 2022, trained an encoder purely with contrastive objectives over unlabelled corpora using random cropping to build positive pairs.[^1][^14] It improved zero-shot performance but still lagged fine-tuned retrievers on out-of-domain tasks. In parallel, methods such as InPars from Bonifacio et al. and Promptagator from Dai et al. used [large language models](/wiki/large_language_model) to *generate synthetic queries* from real corpus documents, then fine-tuned a small retriever on those synthetic pairs; this trades inference-time LLM calls for offline training cost but requires per-corpus fine-tuning.[^1]

### Conceptual move

The HyDE authors framed zero-shot retrieval as fundamentally hard because nothing in the encoder's training tells it which corpus passage answers a given query intent: queries and answers occupy different distributions, and an unsupervised encoder cannot bridge them.[^1] Their proposal inverted the InPars/Promptagator direction. Rather than generating queries from documents and re-training the retriever, HyDE generates *documents from queries at inference time* and reuses an off-the-shelf unsupervised encoder unchanged.[^1] The hard work of mapping a question to an answer text is delegated to a generic instruction-following [language model](/wiki/language_model) such as InstructGPT, which has been trained to follow natural language instructions on broad web data, while the dense encoder is given the easier job of finding real passages near a generated text in [embedding space](/wiki/embedding_space).[^1]

### When was HyDE published?

The paper was first posted to arXiv on 20 December 2022 with arXiv identifier 2212.10496[^1] and later published in the *Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics* (ACL 2023, Volume 1: Long Papers) in Toronto, Canada, pages 1762 to 1777.[^2] Lead author Luyu Gao was a PhD student at Carnegie Mellon University's Language Technologies Institute under Jamie Callan and previously studied computer science at the University of Illinois at Urbana-Champaign; he has worked on retrieval, [transformer](/wiki/transformer) pre-training, and program-aided reasoning, and was also lead author of the PAL paper on program-aided language models.[^5] Co-authors Xueguang Ma and Jimmy Lin were at the University of Waterloo's David R. Cheriton School of Computer Science; Lin is the principal investigator behind the Pyserini toolkit that HyDE's reference implementation uses for evaluation.[^5][^7] Jamie Callan is a long-time information retrieval professor at [Carnegie Mellon University](/wiki/cmu)'s Language Technologies Institute.[^5]

## How does HyDE work?

HyDE replaces the single encoding step of a dense retriever with a two-stage pipeline. The query path becomes "query, then LLM, then encoder, then index"; the corpus path is unchanged from a normal dense retriever.[^1] In the authors' own words, "HyDE first zero-shot instructs an instruction-following language model (e.g. InstructGPT) to generate a hypothetical document" that "captures relevance patterns but is unreal and may contain false details."[^1]

### Pipeline

1. **Instruction prompt**: a task-specific natural language instruction is concatenated with the user query and sent to an instruction-following LLM. The reference implementation uses InstructGPT (`text-davinci-003`) with a prompt such as `"Please write a passage to answer the question.\n\nQuestion: {query}\n\nPassage:"`.[^6]
2. **Hypothetical document generation**: the LLM samples one or more candidate "documents" of a few sentences each. These passages are not retrieved from the corpus and may contain factual errors or invented details; the authors call them deliberately *hypothetical*.[^1] In the reference setup the model is asked to write a passage of typical length and genre for the target domain (a financial article, a scientific abstract, a news story).[^6]
3. **Document embedding**: each generated document, plus optionally the original query, is encoded by an unsupervised contrastive encoder. The reference setup uses Contriever for English and mContriever for multilingual experiments.[^1] Because the corpus index was built with the same encoder, the generated text lands in the same vector space as real passages.[^1]
4. **Vector averaging**: when multiple hypothetical documents are sampled (the paper uses N = 8), their embeddings are averaged with the query embedding to form a single search vector.[^1] The query embedding is included as a regularizer that anchors the search to the surface form of the question and dampens the influence of any single bad generation.[^1]
5. **Nearest-neighbor search**: the averaged search vector is used to retrieve real passages from a corpus-side dense index built with the same encoder, typically through [FAISS](/wiki/faiss) or Pyserini.[^7] The retrieved passages, not the hypothetical document, are the final output of the system.[^1]

### Why it works

The hypothetical document expresses the relevance pattern the user implicitly cares about. Although the generation may hallucinate specifics, the encoder's low-dimensional dense bottleneck filters out incorrect tokens and projects the output near the true answer's neighborhood in [embedding space](/wiki/embedding_space).[^1] HyDE thus exploits the LLM's world knowledge for query understanding while delegating final grounding to the corpus index, which prevents the system from returning fabricated content directly to the user; only real corpus passages are ever retrieved.[^1]

The technique also recasts the query, which is often a short keyword string or a question fragment, into the genre, vocabulary, and length distribution of the target corpus. This brings the search vector closer to the manifold occupied by the actual passages and avoids the well-known query-document length and style mismatch that hurts naive [cosine similarity](/wiki/cosine_similarity) search with general-purpose encoders.[^1] The authors note that, in this sense, HyDE generalises the classical query expansion strategy of pseudo-relevance feedback: instead of expanding using terms drawn from a noisy first-pass retrieval, it expands using a passage drawn from the LLM's parametric memory.[^1]

### Hyperparameters

The reference implementation exposes a small number of knobs: the prompt template, the generator model, the sampling `temperature` (set to 0.7 in the published experiments), the number of samples `n` (set to 8), and the maximum number of tokens per sample (set to 512).[^9] Lowering `n` reduces cost and latency at some cost in retrieval quality; raising `temperature` increases diversity across samples but also the risk of off-topic generations.[^9]

### Prompt templates

The reference repository ships eight task-specific prompts, one per benchmark domain. Each prompt instructs the LLM to write a passage in the appropriate genre.[^6]

| Task key | Prompt skeleton |
|---|---|
| `web_search` | "Please write a passage to answer the question." |
| `scifact` | "Please write a scientific paper passage to support/refute the claim." |
| `arguana` | "Please write a counter argument for the passage." |
| `trec_covid` | "Please write a scientific paper passage to answer the question." |
| `fiqa` | "Please write a financial article passage to answer the question." |
| `dbpedia_entity` | "Please write a passage to answer the question." |
| `trec_news` | "Please write a news passage about the topic." |
| `mr_tydi` | "Please write a passage in {language} to answer the question in detail." |

The `mr_tydi` template additionally accepts a language name, which lets HyDE generate hypothetical passages in the target language before encoding them with a multilingual encoder.[^6]

## How well does HyDE perform?

Gao and colleagues evaluated HyDE on three families of benchmarks: TREC Deep Learning web search (DL19, DL20), BEIR low-resource retrieval, and Mr.TyDi multilingual retrieval. All numbers below are taken from the paper's tables and reflect HyDE with InstructGPT plus Contriever (mContriever for Mr.TyDi).[^1]

### Web search (TREC DL)

On the TREC Deep Learning passage tracks built on MS MARCO, HyDE reached nDCG@10 of 61.3 on DL19 and 57.9 on DL20, compared with Contriever (44.5 / 42.1) and lexical [BM25](/wiki/bm25) (50.6 / 48.0). The fine-tuned supervised baseline ContrieverFT scored 62.1 / 63.2, so HyDE essentially closed the gap on DL19 without any labels and remained competitive on DL20.[^1]

### Low-resource BEIR tasks (nDCG@10)

| Dataset | HyDE | Contriever | BM25 |
|---|---|---|---|
| SciFact | 69.1 | 64.9 | 67.9 |
| ArguAna | 46.6 | 37.9 | 39.7 |
| TREC-COVID | 59.3 | 27.3 | 59.5 |
| FiQA | 27.3 | 24.5 | 23.6 |
| DBPedia | 36.8 | 29.2 | 31.8 |
| TREC-NEWS | 44.0 | 34.8 | 39.5 |

Across BEIR datasets, HyDE strongly improved on the underlying Contriever encoder and matched or beat BM25 on every domain in the paper's table.[^1]

### Multilingual retrieval (Mr.TyDi MRR@100)

| Language | HyDE | mContriever | mContrieverFT |
|---|---|---|---|
| Swahili | 41.7 | 38.3 | 51.2 |
| Korean | 30.6 | 22.3 | 34.2 |
| Japanese | 30.7 | 19.5 | 32.4 |
| Bengali | 41.3 | 35.3 | 42.3 |

On Mr.TyDi, HyDE roughly matched or approached the fully fine-tuned mContrieverFT on Korean, Japanese, and Bengali, and beat the zero-shot mContriever on every language tested, even though the prompts to InstructGPT were issued in English with a language-name slot.[^1]

### Effect of the generator model

The paper additionally reports an ablation over the generator. Replacing InstructGPT with smaller or weaker generators (an unaligned GPT-3 base, or smaller instruction-tuned variants) degraded performance in line with how well each model followed the instruction to write a passage of the requested form.[^1] This is consistent with the conceptual claim that HyDE's quality is bounded by the generator's ability to produce a passage that looks like the answer should look, regardless of factual accuracy in the surface form.[^1]

### Comparison with traditional dense retrievers and query expansion

The authors position HyDE against three families of baselines: lexical BM25, unsupervised dense retrievers (Contriever, mContriever), and supervised fine-tuned dense retrievers.[^1] HyDE is the strongest unsupervised method in every reported track and is the first zero-shot dense pipeline to approach supervised quality.[^1] Conceptually, it sits in the lineage of classical query expansion methods such as pseudo-relevance feedback (PRF), but instead of expanding using terms from initially retrieved documents, HyDE expands by generating a full pseudo-document with an LLM before any retrieval has occurred.[^1] A close successor, *Query2doc* by Wang, Yang, and Wei at Microsoft (March 2023, EMNLP 2023), uses few-shot prompting rather than zero-shot, concatenates the generated pseudo-document to the original query rather than averaging embeddings, and reports 3 to 15 percent BM25 gains on MS MARCO and TREC DL, with smaller but consistent gains layered on top of supervised dense retrievers.[^8] Other LLM-front-of-retrieval techniques released later in 2023 include multi-query expansion (the LLM rewrites a single query as several paraphrases), step-back questioning, and decomposed sub-query generation; LlamaIndex documents HyDE alongside `StepDecomposeQueryTransform` under the same Query Transformations heading.[^12]

## Implementation

The official implementation lives in the [`texttron/hyde`](https://github.com/texttron/hyde) GitHub repository under the texttron organisation, the same group that maintains the Tevatron retrieval toolkit.[^7] It is written primarily as Jupyter notebooks driving Pyserini for dense indexing and FAISS for nearest-neighbor search, with a small Python package containing two main classes; the repository licence is Apache 2.0 and as of mid-2026 it had roughly 580 stars and 40 forks.[^7]

### Repository layout

The repository ships two end-to-end notebooks. `hyde-demo.ipynb` walks through the full pipeline on a single example query: it loads a prebuilt Contriever FAISS index, prompts InstructGPT for eight hypothetical passages, averages the embeddings, and retrieves the top results from MS MARCO.[^7] `hyde-dl19.ipynb` reproduces the TREC DL19 numbers from the paper using Pyserini's evaluation scripts.[^7] Setup requires installing Pyserini, downloading the Contriever index that the repository links to, and configuring an OpenAI API key through an environment variable.[^7]

### Key classes

The `Promptor` class (in `src/hyde/promptor.py`) stores the eight benchmark-specific prompt templates listed above and formats a query into the full instruction string.[^6] The `Generator` class (in `src/hyde/generator.py`) is an abstract base with two production subclasses: `OpenAIGenerator`, which calls the [OpenAI API](/wiki/openai_api) for InstructGPT or GPT-3 family models, and `CohereGenerator`, which calls [Cohere](/wiki/cohere).[^9] Both accept configuration for model name, API key, number of samples `n` (default 8), `max_tokens` (default 512), `temperature` (default 0.7), `top_p`, frequency and presence penalties, and a `wait_till_success` retry flag for handling rate-limit errors from the underlying API.[^9] Retrieval itself relies on a pre-built Contriever FAISS index that the repository links from external storage, and on a small `HyDE` class that wires `Promptor`, `Generator`, an encoder, and a searcher into a single `e2e_search` call.[^7]

### Citation

The repository provides the following BibTeX entry for the arXiv version of the paper:[^7]

```
@article{hyde,
  title  = {Precise Zero-Shot Dense Retrieval without Relevance Labels},
  author = {Luyu Gao and Xueguang Ma and Jimmy Lin and Jamie Callan},
  journal = {arXiv preprint arXiv:2212.10496},
  year   = {2022}
}
```

## Integrations

### LangChain

[LangChain](/wiki/langchain) exposes HyDE as `HypotheticalDocumentEmbedder` in the Python package and `HydeRetriever` in the TypeScript package. The Python class wraps any `Embeddings` model and any `BaseLanguageModel`, intercepts the embedding call, generates hypothetical documents with an LLMChain, and returns their averaged embedding to the caller.[^3] A convenience `HypotheticalDocumentEmbedder.from_llm` constructor accepts a prompt key from a built-in set that mirrors the texttron prompts (`web_search`, `sci_fact`, `arguana`, `trec_covid`, `fiqa`, `dbpedia_entity`, `trec_news`, `mr_tydi`), or a fully custom `PromptTemplate`.[^3] Because it conforms to the `Embeddings` interface, the resulting object can be passed directly into any LangChain vector store (for example a Chroma or FAISS store), which makes HyDE a drop-in replacement for a normal query embedder at indexing or query time.[^3]

The JavaScript `HydeRetriever` (now part of `@langchain/community` under the `oss/javascript` docs site) accepts a `vectorStore`, an `llm`, a `k` parameter for the number of results, and an optional custom prompt template with a single `{question}` variable; its defaults follow the prompts from the academic paper.[^10] In code, a minimal instantiation looks like `new HydeRetriever({ vectorStore, llm, k: 1 })`.[^10]

### LlamaIndex

[LlamaIndex](/wiki/llamaindex) ships HyDE as `HyDEQueryTransform` under `llama_index.core.indices.query.query_transform`. The transform accepts an `llm`, an optional `hyde_prompt`, and an `include_original` flag; it is composed with a base query engine through `TransformQueryEngine` so that the underlying vector index receives the hypothetical answer as its embedding text instead of the raw query.[^4] A typical usage looks like:

```python
from llama_index.core.indices.query.query_transform import HyDEQueryTransform
from llama_index.core.query_engine import TransformQueryEngine

hyde = HyDEQueryTransform(include_original=True)
hyde_query_engine = TransformQueryEngine(query_engine, hyde)
response = hyde_query_engine.query(query_str)
```

LlamaIndex documents HyDE alongside `StepDecomposeQueryTransform` in its Query Transformations module and explicitly notes failure modes: open-ended or ambiguous queries can lead HyDE to fabricate misleading "answers" that pull retrieval off topic, and the team recommends inspecting outputs before deploying it on subjective questions.[^4] The framework's own example notebook reports that on a Paul Graham essays corpus, HyDE improved answer quality on factual queries by generating plausible content that lined up with what the essays actually said, while it could mislead on broader interpretive questions.[^4]

### Other ecosystem use

Vector database vendors and tutorial sites have published their own HyDE walkthroughs that reuse the LangChain or LlamaIndex classes against backends such as [Chroma](/wiki/chroma), [Pinecone](/wiki/pinecone), [Weaviate](/wiki/weaviate), [Qdrant](/wiki/qdrant), or [Milvus](/wiki/milvus).[^11] The technique has since been combined with rerankers, [ColBERT](/wiki/colbert) late interaction, and sparse [SPLADE](/wiki/splade) vectors in hybrid retrieval pipelines, where HyDE handles the dense path and BM25 or SPLADE handles the lexical path before a cross-encoder rerank.[^11][^12] Embedding-benchmark efforts such as [MTEB](/wiki/mteb) have helped users decide which dense encoder to combine with HyDE for a given domain, since the technique inherits the underlying encoder's strengths and weaknesses.[^11]

## Applications and Significance

HyDE was an early and unusually clean demonstration that LLM-generated text can substitute for missing supervision signal in [information retrieval](/wiki/information_retrieval). It generalised an old idea (query expansion with pseudo documents) into a regime where the expansion is produced by a general-purpose [instruction-tuned](/wiki/instruction_tuning) model rather than by relevance feedback over a first-pass index.[^1] For practitioners, HyDE became a near-default option to try whenever a vector store underperforms on a new domain, because it requires no labelling and no encoder fine-tuning, and it composes naturally with existing dual encoders and vector indexes.[^11]

### Where HyDE helps in production RAG

In production [retrieval-augmented generation](/wiki/retrieval_augmented_generation) stacks, HyDE is most often used when (a) the corpus is specialised (legal, biomedical, code, financial) and a general-purpose embedding model performs poorly out of the box, (b) the user queries are very short or keyword-like and the corpus passages are long and prose-like, or (c) the team lacks the labelled relevance pairs that would be needed to fine-tune the dense retriever for the domain.[^11] In these settings HyDE often closes most of the gap to a fine-tuned retriever for free, at the cost of one extra LLM call per query.[^1][^11]

### Broader influence

The paper also helped popularise the broader pattern of "LLM in front of retrieval", which now spans query rewriting, multi-query expansion, [hallucination](/wiki/hallucination)-aware reranking, and step-back questioning. Within months of the HyDE release, Query2doc adapted the approach to few-shot prompting and showed gains on top of BM25,[^8] and Wikipedia-style tutorials, vendor blogs, and follow-on retrieval surveys routinely cite HyDE as the canonical instance of the technique.[^11] HyDE is also referenced in textbook treatments of dense retrieval as the first clean negative result for the assumption that zero-shot dense retrievers cannot match supervised ones without labelled data.[^1][^11]

### Adoption in libraries and platforms

By mid-2024, HyDE was available as a built-in component in both major Python LLM frameworks (LangChain and LlamaIndex)[^3][^4], in several vector database tutorial paths (Pinecone, Weaviate, Qdrant, Milvus, Chroma, Zilliz)[^11], and in academic toolkits including Pyserini-based notebooks that ship with the reference repository.[^7] Independent academic groups have applied the same generate-then-embed pattern to legal QA, biomedical literature search, code search, and developer support QA, sometimes branding their variant as "Adaptive HyDE" when the system chooses dynamically whether to invoke the LLM at all.[^11]

## Limitations

### Generator-dependent quality

The HyDE authors themselves caution that the technique inherits the failure modes of the underlying LLM. Hypothetical passages may hallucinate plausible but corpus-absent details, and the dense bottleneck only partially filters these out; performance therefore depends on whether the LLM has been exposed to enough domain-relevant text to write a useful pseudo-document.[^1] On niche corpora that fall outside the LLM's pre-training distribution (for example a private codebase, a regulatory archive, or a non-English low-resource domain), the generated passages may be generic platitudes that fail to discriminate between candidate documents in the corpus.[^1]

### Ambiguous and open-ended queries

LlamaIndex's documentation flags two practical failure cases: queries that are ambiguous without context (the generated passage drifts away from the user intent, taking the embedding with it) and open-ended subjective questions (the LLM may inject bias that warps retrieval).[^4] For example, on a query like "What are the best machine learning algorithms?" the generator may write a passage anchored on a particular family of methods (decision trees, neural networks) that excludes documents about the alternatives, narrowing rather than broadening retrieval.[^4]

### How fast is HyDE, and what does it cost?

Latency and cost are also issues. Each retrieval call now requires at least one LLM generation, which is far slower and more expensive than embedding a short query directly; sampling N = 8 hypothetical documents, as in the paper, multiplies the cost further.[^1] In a production setting where a normal query embedding might take a few tens of milliseconds, a single HyDE call can take a second or more even with a fast hosted model, plus the per-token cost of generation.[^1] These overheads have led some production systems to apply HyDE only on queries that the system judges retrieval-difficult, to use cheaper generators (such as a small open-weights model) for the generation step, or to cache hypothetical documents for common queries.[^11]

### Knowledge leakage critique

A more fundamental critique appears in Yoon, Jung, Yoon, and Park's 2025 paper *Hypothetical Documents or Knowledge Leakage? Rethinking LLM-based Query Expansion*. Across three fact-verification benchmarks (FEVER, SciFact, and AVeriTeC) and seven LLMs, the authors show that HyDE and Query2doc gains correlate strongly with whether the LLM's generated text contains sentences entailed by the gold evidence: in most settings over 40 percent of generated documents matched gold evidence, peaking at 83.5 percent on FEVER with GPT-4o-mini, and the techniques fell below the no-expansion baseline on claims whose answers were not in the model's training distribution.[^13] They report that "performance improvements consistently occurred for claims whose generated documents included sentences entailed by gold evidence," and argue that some reported gains may reflect *knowledge leakage* from pre-training rather than improved query-document alignment.[^13] This implies that HyDE may be most valuable when the generator's parametric knowledge overlaps with the corpus and least valuable on the very out-of-distribution problems that motivate zero-shot retrieval in the first place.[^13]

### Tighter coupling between generator and encoder

A practical engineering downside is that HyDE introduces a runtime coupling between the LLM and the dense encoder. Changing the generator (for example moving from InstructGPT to a newer model with a different writing style) changes the distribution of hypothetical documents and can shift which corpus passages end up in the top-k. Teams that adopt HyDE typically need to re-evaluate the pipeline whenever the generator is upgraded, in addition to the usual re-evaluation when the embedding model is upgraded.[^11]

## Follow-up Work

### Query2doc

Liang Wang, Nan Yang, and Furu Wei at Microsoft Research released *Query2doc: Query Expansion with Large Language Models* on arXiv on 14 March 2023 and published it at EMNLP 2023.[^8] Query2doc differs from HyDE on three axes. First, it uses few-shot rather than zero-shot prompting; the prompt includes several real query-document pairs as in-context examples before asking the LLM to write a pseudo-document for the new query.[^8] Second, it concatenates the generated pseudo-document to the original query string rather than averaging embeddings; the concatenated string is then fed to a normal retriever, which makes the method usable for sparse BM25 as well as dense indexes.[^8] Third, the reported gains are layered on supervised retrievers in addition to BM25, with the authors reporting a 3 to 15 percent improvement over BM25 alone on MS MARCO and TREC DL.[^8]

### Hypothetical Documents or Knowledge Leakage?

Yoon, Jung, Yoon, and Park's 2025 critique paper, discussed above, is the most cited follow-up evaluation of HyDE-style methods. It frames HyDE and Query2doc not as orthogonal retrieval techniques but as proxies for memorisation, and recommends that future evaluations measure overlap with pre-training data when claiming gains from LLM-based query expansion.[^13]

### Adaptive and decomposed variants

Subsequent work has explored *adaptive* HyDE pipelines that invoke the generator only on queries the system flags as retrieval-difficult, with the goal of recovering the quality wins on hard queries while avoiding the cost on easy ones.[^11] LlamaIndex has separately added query decomposition transforms that, like HyDE, manipulate the query before retrieval; the two are often combined.[^12] Application-domain papers have applied HyDE-style retrieval to developer support QA, biomedical literature search, and tutoring systems, in each case noting that the generated passage's quality dominates retrieval quality.[^11]

## How does HyDE compare to other retrievers?

| Method | Year | Mechanism | Supervision | Typical pairing |
|---|---|---|---|---|
| BM25 | 1994 (Robertson) | Lexical scoring | None | Sparse index |
| [DPR](/wiki/dense_passage_retrieval) | 2020 | Dual encoder, in-domain training | Supervised | FAISS |
| Contriever | 2022 | Unsupervised contrastive dual encoder | Self-supervised | FAISS |
| HyDE | 2022 | LLM hypothetical doc + unsupervised dense encoder | Zero-shot | Contriever + FAISS |
| Query2doc | 2023 | LLM pseudo-doc + concatenation with query | Few-shot | BM25 or dense retriever |
| [ColBERT](/wiki/colbert) late interaction | 2020 / v2 2021 | Token-level MaxSim | Supervised | ColBERT index |
| [SPLADE](/wiki/splade) | 2021 | Learned sparse expansion | Supervised | Inverted index |

HyDE is best understood as orthogonal rather than competing to most of the entries in this table: it modifies only the query path and can be layered on top of any dense retriever or even on BM25 (by using the generated pseudo-document as the search string), and it composes with reranking, hybrid sparse-plus-dense pipelines, and per-domain fine-tuning.[^1][^11]

## See also

- [Retrieval-Augmented Generation (RAG)](/wiki/retrieval_augmented_generation)
- [Dense Passage Retrieval (DPR)](/wiki/dense_passage_retrieval)
- [ColBERT](/wiki/colbert)
- [SPLADE](/wiki/splade)
- [BM25](/wiki/bm25)
- [Information Retrieval](/wiki/information_retrieval)
- [Embeddings](/wiki/embeddings)
- [Cosine similarity](/wiki/cosine_similarity)
- [FAISS](/wiki/faiss)
- [MTEB (Massive Text Embedding Benchmark)](/wiki/mteb)
- [LangChain](/wiki/langchain)
- [LlamaIndex](/wiki/llamaindex)
- [Chroma](/wiki/chroma)
- [Pinecone](/wiki/pinecone)
- [Weaviate](/wiki/weaviate)
- [Qdrant](/wiki/qdrant)
- [Milvus](/wiki/milvus)
- [InstructGPT](/wiki/instructgpt)
- [Zero-shot, one-shot and few-shot learning](/wiki/zero_shot_one_shot_and_few_shot_learning)
- [Hallucination](/wiki/hallucination)
- [Carnegie Mellon University](/wiki/cmu)

## References

[^1]: Luyu Gao, Xueguang Ma, Jimmy Lin, Jamie Callan, "Precise Zero-Shot Dense Retrieval without Relevance Labels", arXiv:2212.10496, 2022-12-20. https://arxiv.org/abs/2212.10496. Accessed 2026-06-24.
[^2]: Luyu Gao, Xueguang Ma, Jimmy Lin, Jamie Callan, "Precise Zero-Shot Dense Retrieval without Relevance Labels", Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Long Papers), ACL Anthology 2023.acl-long.99, pp. 1762-1777, 2023-07, Toronto, Canada. https://aclanthology.org/2023.acl-long.99/. Accessed 2026-06-24.
[^3]: LangChain, "HypotheticalDocumentEmbedder API reference", LangChain Python documentation, 2024. https://api.python.langchain.com/en/latest/chains/langchain.chains.hyde.base.HypotheticalDocumentEmbedder.html. Accessed 2026-05-21.
[^4]: LlamaIndex, "HyDE Query Transform Demo", LlamaIndex developer documentation, 2024. https://developers.llamaindex.ai/python/examples/query_transformations/hydequerytransformdemo/. Accessed 2026-05-21.
[^5]: Luyu Gao, "About", Personal website, Language Technologies Institute, Carnegie Mellon University, 2024. https://luyug.github.io/. Accessed 2026-05-21.
[^6]: texttron, "promptor.py", `texttron/hyde` GitHub repository, 2022. https://github.com/texttron/hyde/blob/main/src/hyde/promptor.py. Accessed 2026-05-21.
[^7]: texttron, "HyDE: Precise Zero-Shot Dense Retrieval without Relevance Labels", `texttron/hyde` GitHub repository README, 2022. https://github.com/texttron/hyde. Accessed 2026-06-24.
[^8]: Liang Wang, Nan Yang, Furu Wei, "Query2doc: Query Expansion with Large Language Models", arXiv:2303.07678, 2023-03-14 (EMNLP 2023). https://arxiv.org/abs/2303.07678. Accessed 2026-06-24.
[^9]: texttron, "generator.py", `texttron/hyde` GitHub repository, 2022. https://github.com/texttron/hyde/blob/main/src/hyde/generator.py. Accessed 2026-05-21.
[^10]: LangChain, "Hyde integration (JavaScript)", LangChain documentation, 2024. https://docs.langchain.com/oss/javascript/integrations/retrievers/hyde. Accessed 2026-05-21.
[^11]: Zilliz, "Better RAG with HyDE: Hypothetical Document Embeddings", Zilliz Learn, 2024. https://zilliz.com/learn/improve-rag-and-information-retrieval-with-hyde-hypothetical-document-embeddings. Accessed 2026-05-21.
[^12]: LlamaIndex, "Query Transformations", LlamaIndex developer documentation, 2024. https://developers.llamaindex.ai/python/framework/optimizing/advanced_retrieval/query_transformations/. Accessed 2026-05-21.
[^13]: Yejun Yoon, Jaeyoon Jung, Seunghyun Yoon, Kunwoo Park, "Hypothetical Documents or Knowledge Leakage? Rethinking LLM-based Query Expansion", arXiv:2504.14175, 2025-04-19. https://arxiv.org/abs/2504.14175. Accessed 2026-06-24.
[^14]: Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, Edouard Grave, "Unsupervised Dense Information Retrieval with Contrastive Learning", arXiv:2112.09118, Transactions on Machine Learning Research, 2022. https://arxiv.org/abs/2112.09118. Accessed 2026-06-24.

