HyDE (Hypothetical Document Embeddings)
Last reviewed
Jun 7, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 ยท 4,413 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 7, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 ยท 4,413 words
Add missing citations, update stale details, or suggest a clearer explanation.
HyDE (Hypothetical Document Embeddings) is a zero-shot dense retrieval technique introduced in December 2022 by Luyu Gao, Xueguang Ma, Jimmy Lin, and Jamie Callan in the paper Precise Zero-Shot Dense Retrieval without Relevance Labels.[1] Rather than embedding a user query directly into a vector space and searching for similar documents, HyDE first prompts an instruction-following large language model to write a hypothetical answer document for the query, then embeds that synthetic document with an unsupervised encoder such as Contriever and uses its vector for nearest-neighbor search in the corpus.[1] The method requires no relevance labels and no fine-tuned dual encoder, yet on TREC Deep Learning, BEIR, and Mr.TyDi benchmarks it outperformed the strongest unsupervised baseline (Contriever) and approached the quality of supervised, in-domain fine-tuned retrievers.[1][2] HyDE became one of the most widely adopted query-side enhancements for retrieval-augmented generation systems, with first-party integrations in LangChain and LlamaIndex.[3][4]
Dense retrievers based on dual encoders, such as Dense Passage Retrieval (DPR), encode queries and passages into a shared vector space and retrieve passages by cosine or inner-product similarity.[1] These models typically need substantial training on labelled query-document pairs (for example MS MARCO) to learn that space well, because the two encoders must agree on which surface forms of a question map to which surface forms of an answering passage.[1] When the same encoder is moved to a new domain without further supervision, retrieval quality drops sharply, a weakness documented in the BEIR benchmark of Thakur et al., which collected nine heterogeneous retrieval tasks spanning fact verification, scientific question answering, financial QA, and argument retrieval.[1] On many BEIR tasks, supervised dense retrievers actually underperformed lexical BM25, a result that motivated a wave of follow-up work on better zero-shot encoders.[1]
The two main classes of pre-HyDE responses were better self-supervised encoders and synthetic-data approaches. Contriever, introduced by Izacard et al. at Meta in 2022, trained an encoder purely with contrastive objectives over unlabelled corpora using random cropping to build positive pairs.[1] It improved zero-shot performance but still lagged fine-tuned retrievers on out-of-domain tasks. In parallel, methods such as InPars from Bonifacio et al. and Promptagator from Dai et al. used large language models to generate synthetic queries from real corpus documents, then fine-tuned a small retriever on those synthetic pairs; this trades inference-time LLM calls for offline training cost but requires per-corpus fine-tuning.[1]
The HyDE authors framed zero-shot retrieval as fundamentally hard because nothing in the encoder's training tells it which corpus passage answers a given query intent: queries and answers occupy different distributions, and an unsupervised encoder cannot bridge them.[1] Their proposal inverted the InPars/Promptagator direction. Rather than generating queries from documents and re-training the retriever, HyDE generates documents from queries at inference time and reuses an off-the-shelf unsupervised encoder unchanged.[1] The hard work of mapping a question to an answer text is delegated to a generic instruction-following language model such as InstructGPT, which has been trained to follow natural language instructions on broad web data, while the dense encoder is given the easier job of finding real passages near a generated text in embedding space.[1]
The paper was first posted to arXiv on 20 December 2022 with arXiv identifier 2212.10496[1] and later published in the Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023, Volume 1: Long Papers) in Toronto, pages 1762 to 1777.[2] Lead author Luyu Gao was a PhD student at Carnegie Mellon University's Language Technologies Institute under Jamie Callan and previously studied computer science at the University of Illinois at Urbana-Champaign; he has worked on retrieval, transformer pre-training, and program-aided reasoning, and was also lead author of the PAL paper on program-aided language models.[5] Co-authors Xueguang Ma and Jimmy Lin were at the University of Waterloo's David R. Cheriton School of Computer Science; Lin is the principal investigator behind the Pyserini toolkit that HyDE's reference implementation uses for evaluation.[5][7] Jamie Callan is a long-time information retrieval professor at Carnegie Mellon University's Language Technologies Institute.[5]
HyDE replaces the single encoding step of a dense retriever with a two-stage pipeline. The query path becomes "query, then LLM, then encoder, then index"; the corpus path is unchanged from a normal dense retriever.[1]
text-davinci-003) with a prompt such as "Please write a passage to answer the question.\n\nQuestion: {query}\n\nPassage:".[6]The hypothetical document expresses the relevance pattern the user implicitly cares about. Although the generation may hallucinate specifics, the encoder's low-dimensional dense bottleneck filters out incorrect tokens and projects the output near the true answer's neighborhood in embedding space.[1] HyDE thus exploits the LLM's world knowledge for query understanding while delegating final grounding to the corpus index, which prevents the system from returning fabricated content directly to the user; only real corpus passages are ever retrieved.[1]
The technique also recasts the query, which is often a short keyword string or a question fragment, into the genre, vocabulary, and length distribution of the target corpus. This brings the search vector closer to the manifold occupied by the actual passages and avoids the well-known query-document length and style mismatch that hurts naive cosine similarity search with general-purpose encoders.[1] The authors note that, in this sense, HyDE generalises the classical query expansion strategy of pseudo-relevance feedback: instead of expanding using terms drawn from a noisy first-pass retrieval, it expands using a passage drawn from the LLM's parametric memory.[1]
The reference implementation exposes a small number of knobs: the prompt template, the generator model, the sampling temperature (set to 0.7 in the published experiments), the number of samples n (set to 8), and the maximum number of tokens per sample (set to 512).[9] Lowering n reduces cost and latency at some cost in retrieval quality; raising temperature increases diversity across samples but also the risk of off-topic generations.[9]
The reference repository ships eight task-specific prompts, one per benchmark domain. Each prompt instructs the LLM to write a passage in the appropriate genre.[6]
| Task key | Prompt skeleton |
|---|---|
web_search | "Please write a passage to answer the question." |
scifact | "Please write a scientific paper passage to support/refute the claim." |
arguana | "Please write a counter argument for the passage." |
trec_covid | "Please write a scientific paper passage to answer the question." |
fiqa | "Please write a financial article passage to answer the question." |
dbpedia_entity | "Please write a passage to answer the question." |
trec_news | "Please write a news passage about the topic." |
mr_tydi | "Please write a passage in {language} to answer the question in detail." |
The mr_tydi template additionally accepts a language name, which lets HyDE generate hypothetical passages in the target language before encoding them with a multilingual encoder.[6]
Gao and colleagues evaluated HyDE on three families of benchmarks: TREC Deep Learning web search (DL19, DL20), BEIR low-resource retrieval, and Mr.TyDi multilingual retrieval. All numbers below are taken from the paper's tables and reflect HyDE with InstructGPT plus Contriever (mContriever for Mr.TyDi).[1]
On the TREC Deep Learning passage tracks built on MS MARCO, HyDE reached nDCG@10 of 61.3 on DL19 and 57.9 on DL20, compared with Contriever (44.5 / 42.1) and lexical BM25 (50.6 / 48.0). The fine-tuned supervised baseline ContrieverFT scored 62.1 / 63.2, so HyDE essentially closed the gap on DL19 without any labels and remained competitive on DL20.[1]
| Dataset | HyDE | Contriever | BM25 |
|---|---|---|---|
| SciFact | 69.1 | 64.9 | 67.9 |
| ArguAna | 46.6 | 37.9 | 39.7 |
| TREC-COVID | 59.3 | 27.3 | 59.5 |
| FiQA | 27.3 | 24.5 | 23.6 |
| DBPedia | 36.8 | 29.2 | 31.8 |
| TREC-NEWS | 44.0 | 34.8 | 39.5 |
Across BEIR datasets, HyDE strongly improved on the underlying Contriever encoder and matched or beat BM25 on every domain in the paper's table.[1]
| Language | HyDE | mContriever | mContrieverFT |
|---|---|---|---|
| Swahili | 41.7 | 38.3 | 51.2 |
| Korean | 30.6 | 22.3 | 34.2 |
| Japanese | 30.7 | 19.5 | 32.4 |
| Bengali | 41.3 | 35.3 | 42.3 |
On Mr.TyDi, HyDE roughly matched or approached the fully fine-tuned mContrieverFT on Korean, Japanese, and Bengali, and beat the zero-shot mContriever on every language tested, even though the prompts to InstructGPT were issued in English with a language-name slot.[1]
The paper additionally reports an ablation over the generator. Replacing InstructGPT with smaller or weaker generators (an unaligned GPT-3 base, or smaller instruction-tuned variants) degraded performance in line with how well each model followed the instruction to write a passage of the requested form.[1] This is consistent with the conceptual claim that HyDE's quality is bounded by the generator's ability to produce a passage that looks like the answer should look, regardless of factual accuracy in the surface form.[1]
The authors position HyDE against three families of baselines: lexical BM25, unsupervised dense retrievers (Contriever, mContriever), and supervised fine-tuned dense retrievers.[1] HyDE is the strongest unsupervised method in every reported track and is the first zero-shot dense pipeline to approach supervised quality.[1] Conceptually, it sits in the lineage of classical query expansion methods such as pseudo-relevance feedback (PRF), but instead of expanding using terms from initially retrieved documents, HyDE expands by generating a full pseudo-document with an LLM before any retrieval has occurred.[1] A close successor, Query2doc by Wang, Yang, and Wei at Microsoft (March 2023, EMNLP 2023), uses few-shot prompting rather than zero-shot, concatenates the generated pseudo-document to the original query rather than averaging embeddings, and reports 3 to 15 percent BM25 gains on MS MARCO and TREC DL, with smaller but consistent gains layered on top of supervised dense retrievers.[8] Other LLM-front-of-retrieval techniques released later in 2023 include multi-query expansion (the LLM rewrites a single query as several paraphrases), step-back questioning, and decomposed sub-query generation; LlamaIndex documents HyDE alongside StepDecomposeQueryTransform under the same Query Transformations heading.[12]
The official implementation lives in the texttron/hyde GitHub repository under the texttron organisation, the same group that maintains the Tevatron retrieval toolkit.[7] It is written primarily as Jupyter notebooks driving Pyserini for dense indexing and FAISS for nearest-neighbor search, with a small Python package containing two main classes; the repository licence is Apache 2.0 and as of mid-2024 it had around 579 stars and 40 forks.[7]
The repository ships two end-to-end notebooks. hyde-demo.ipynb walks through the full pipeline on a single example query: it loads a prebuilt Contriever FAISS index, prompts InstructGPT for eight hypothetical passages, averages the embeddings, and retrieves the top results from MS MARCO.[7] hyde-dl19.ipynb reproduces the TREC DL19 numbers from the paper using Pyserini's evaluation scripts.[7] Setup requires installing Pyserini, downloading the Contriever index that the repository links to, and configuring an OpenAI API key through an environment variable.[7]
The Promptor class (in src/hyde/promptor.py) stores the eight benchmark-specific prompt templates listed above and formats a query into the full instruction string.[6] The Generator class (in src/hyde/generator.py) is an abstract base with two production subclasses: OpenAIGenerator, which calls the OpenAI API for InstructGPT or GPT-3 family models, and CohereGenerator, which calls Cohere.[9] Both accept configuration for model name, API key, number of samples n (default 8), max_tokens (default 512), temperature (default 0.7), top_p, frequency and presence penalties, and a wait_till_success retry flag for handling rate-limit errors from the underlying API.[9] Retrieval itself relies on a pre-built Contriever FAISS index that the repository links from external storage, and on a small HyDE class that wires Promptor, Generator, an encoder, and a searcher into a single e2e_search call.[7]
The repository provides the following BibTeX entry for the arXiv version of the paper:[7]
@article{hyde,
title = {Precise Zero-Shot Dense Retrieval without Relevance Labels},
author = {Luyu Gao and Xueguang Ma and Jimmy Lin and Jamie Callan},
journal = {arXiv preprint arXiv:2212.10496},
year = {2022}
}
LangChain exposes HyDE as HypotheticalDocumentEmbedder in the Python package and HydeRetriever in the TypeScript package. The Python class wraps any Embeddings model and any BaseLanguageModel, intercepts the embedding call, generates hypothetical documents with an LLMChain, and returns their averaged embedding to the caller.[3] A convenience HypotheticalDocumentEmbedder.from_llm constructor accepts a prompt key from a built-in set that mirrors the texttron prompts (web_search, sci_fact, arguana, trec_covid, fiqa, dbpedia_entity, trec_news, mr_tydi), or a fully custom PromptTemplate.[3] Because it conforms to the Embeddings interface, the resulting object can be passed directly into any LangChain vector store (for example a Chroma or FAISS store), which makes HyDE a drop-in replacement for a normal query embedder at indexing or query time.[3]
The JavaScript HydeRetriever (now part of @langchain/community under the oss/javascript docs site) accepts a vectorStore, an llm, a k parameter for the number of results, and an optional custom prompt template with a single {question} variable; its defaults follow the prompts from the academic paper.[10] In code, a minimal instantiation looks like new HydeRetriever({ vectorStore, llm, k: 1 }).[10]
LlamaIndex ships HyDE as HyDEQueryTransform under llama_index.core.indices.query.query_transform. The transform accepts an llm, an optional hyde_prompt, and an include_original flag; it is composed with a base query engine through TransformQueryEngine so that the underlying vector index receives the hypothetical answer as its embedding text instead of the raw query.[4] A typical usage looks like:
from llama_index.core.indices.query.query_transform import HyDEQueryTransform
from llama_index.core.query_engine import TransformQueryEngine
hyde = HyDEQueryTransform(include_original=True)
hyde_query_engine = TransformQueryEngine(query_engine, hyde)
response = hyde_query_engine.query(query_str)
LlamaIndex documents HyDE alongside StepDecomposeQueryTransform in its Query Transformations module and explicitly notes failure modes: open-ended or ambiguous queries can lead HyDE to fabricate misleading "answers" that pull retrieval off topic, and the team recommends inspecting outputs before deploying it on subjective questions.[4] The framework's own example notebook reports that on a Paul Graham essays corpus, HyDE improved answer quality on factual queries by generating plausible content that lined up with what the essays actually said, while it could mislead on broader interpretive questions.[4]
Vector database vendors and tutorial sites have published their own HyDE walkthroughs that reuse the LangChain or LlamaIndex classes against backends such as Chroma, Pinecone, Weaviate, Qdrant, or Milvus.[11] The technique has since been combined with rerankers, ColBERT late interaction, and sparse SPLADE vectors in hybrid retrieval pipelines, where HyDE handles the dense path and BM25 or SPLADE handles the lexical path before a cross-encoder rerank.[11][12] Embedding-benchmark efforts such as MTEB have helped users decide which dense encoder to combine with HyDE for a given domain, since the technique inherits the underlying encoder's strengths and weaknesses.[11]
HyDE was an early and unusually clean demonstration that LLM-generated text can substitute for missing supervision signal in information retrieval. It generalised an old idea (query expansion with pseudo documents) into a regime where the expansion is produced by a general-purpose instruction-tuned model rather than by relevance feedback over a first-pass index.[1] For practitioners, HyDE became a near-default option to try whenever a vector store underperforms on a new domain, because it requires no labelling and no encoder fine-tuning, and it composes naturally with existing dual encoders and vector indexes.[11]
In production retrieval-augmented generation stacks, HyDE is most often used when (a) the corpus is specialised (legal, biomedical, code, financial) and a general-purpose embedding model performs poorly out of the box, (b) the user queries are very short or keyword-like and the corpus passages are long and prose-like, or (c) the team lacks the labelled relevance pairs that would be needed to fine-tune the dense retriever for the domain.[11] In these settings HyDE often closes most of the gap to a fine-tuned retriever for free, at the cost of one extra LLM call per query.[1][11]
The paper also helped popularise the broader pattern of "LLM in front of retrieval", which now spans query rewriting, multi-query expansion, hallucination-aware reranking, and step-back questioning. Within months of the HyDE release, Query2doc adapted the approach to few-shot prompting and showed gains on top of BM25,[8] and Wikipedia-style tutorials, vendor blogs, and follow-on retrieval surveys routinely cite HyDE as the canonical instance of the technique.[11] HyDE is also referenced in textbook treatments of dense retrieval as the first clean negative result for the assumption that zero-shot dense retrievers cannot match supervised ones without labelled data.[1][11]
By mid-2024, HyDE was available as a built-in component in both major Python LLM frameworks (LangChain and LlamaIndex)[3][4], in several vector database tutorial paths (Pinecone, Weaviate, Qdrant, Milvus, Chroma, Zilliz)[11], and in academic toolkits including Pyserini-based notebooks that ship with the reference repository.[7] Independent academic groups have applied the same generate-then-embed pattern to legal QA, biomedical literature search, code search, and developer support QA, sometimes branding their variant as "Adaptive HyDE" when the system chooses dynamically whether to invoke the LLM at all.[11]
The HyDE authors themselves caution that the technique inherits the failure modes of the underlying LLM. Hypothetical passages may hallucinate plausible but corpus-absent details, and the dense bottleneck only partially filters these out; performance therefore depends on whether the LLM has been exposed to enough domain-relevant text to write a useful pseudo-document.[1] On niche corpora that fall outside the LLM's pre-training distribution (for example a private codebase, a regulatory archive, or a non-English low-resource domain), the generated passages may be generic platitudes that fail to discriminate between candidate documents in the corpus.[1]
LlamaIndex's documentation flags two practical failure cases: queries that are ambiguous without context (the generated passage drifts away from the user intent, taking the embedding with it) and open-ended subjective questions (the LLM may inject bias that warps retrieval).[4] For example, on a query like "What are the best machine learning algorithms?" the generator may write a passage anchored on a particular family of methods (decision trees, neural networks) that excludes documents about the alternatives, narrowing rather than broadening retrieval.[4]
Latency and cost are also issues. Each retrieval call now requires at least one LLM generation, which is far slower and more expensive than embedding a short query directly; sampling N = 8 hypothetical documents, as in the paper, multiplies the cost further.[1] In a production setting where a normal query embedding might take a few tens of milliseconds, a single HyDE call can take a second or more even with a fast hosted model, plus the per-token cost of generation.[1] These overheads have led some production systems to apply HyDE only on queries that the system judges retrieval-difficult, to use cheaper generators (such as a small open-weights model) for the generation step, or to cache hypothetical documents for common queries.[11]
A more fundamental critique appears in Yoon, Jung, Yoon, and Park's 2025 paper Hypothetical Documents or Knowledge Leakage? Rethinking LLM-based Query Expansion. Across three benchmarks and seven LLMs, the authors show that HyDE and Query2doc gains correlate strongly with whether the LLM's generated text contains sentences entailed by the gold evidence: over 40 percent of generated documents across their setup matched gold evidence, peaking at 83.5 percent on FEVER with GPT-4o-mini, and the techniques fell below baseline on queries whose answers were not in the model's training distribution.[13] They argue that some reported gains may reflect knowledge leakage from pre-training rather than improved query-document alignment, and that HyDE's value for genuinely novel or post-cutoff information is more limited than its benchmark results suggest.[13] This implies that HyDE may be most valuable when the generator's parametric knowledge overlaps with the corpus and least valuable on the very out-of-distribution problems that motivate zero-shot retrieval in the first place.[13]
A practical engineering downside is that HyDE introduces a runtime coupling between the LLM and the dense encoder. Changing the generator (for example moving from InstructGPT to a newer model with a different writing style) changes the distribution of hypothetical documents and can shift which corpus passages end up in the top-k. Teams that adopt HyDE typically need to re-evaluate the pipeline whenever the generator is upgraded, in addition to the usual re-evaluation when the embedding model is upgraded.[11]
Liang Wang, Nan Yang, and Furu Wei at Microsoft Research released Query2doc: Query Expansion with Large Language Models on arXiv in March 2023 and published it at EMNLP 2023.[8] Query2doc differs from HyDE on three axes. First, it uses few-shot rather than zero-shot prompting; the prompt includes several real query-document pairs as in-context examples before asking the LLM to write a pseudo-document for the new query.[8] Second, it concatenates the generated pseudo-document to the original query string rather than averaging embeddings; the concatenated string is then fed to a normal retriever, which makes the method usable for sparse BM25 as well as dense indexes.[8] Third, the reported gains are layered on supervised retrievers in addition to BM25, with the authors reporting a 3 to 15 percent improvement over BM25 alone on MS MARCO and TREC DL.[8]
Yoon, Jung, Yoon, and Park's 2025 critique paper, discussed above, is the most cited follow-up evaluation of HyDE-style methods. It frames HyDE and Query2doc not as orthogonal retrieval techniques but as proxies for memorisation, and recommends that future evaluations measure overlap with pre-training data when claiming gains from LLM-based query expansion.[13]
Subsequent work has explored adaptive HyDE pipelines that invoke the generator only on queries the system flags as retrieval-difficult, with the goal of recovering the quality wins on hard queries while avoiding the cost on easy ones.[11] LlamaIndex has separately added query decomposition transforms that, like HyDE, manipulate the query before retrieval; the two are often combined.[12] Application-domain papers have applied HyDE-style retrieval to developer support QA, biomedical literature search, and tutoring systems, in each case noting that the generated passage's quality dominates retrieval quality.[11]
| Method | Year | Mechanism | Supervision | Typical pairing |
|---|---|---|---|---|
| BM25 | 1994 (Robertson) | Lexical scoring | None | Sparse index |
| DPR | 2020 | Dual encoder, in-domain training | Supervised | FAISS |
| Contriever | 2022 | Unsupervised contrastive dual encoder | Self-supervised | FAISS |
| HyDE | 2022 | LLM hypothetical doc + unsupervised dense encoder | Zero-shot | Contriever + FAISS |
| Query2doc | 2023 | LLM pseudo-doc + concatenation with query | Few-shot | BM25 or dense retriever |
| ColBERT late interaction | 2020 / v2 2021 | Token-level MaxSim | Supervised | ColBERT index |
| SPLADE | 2021 | Learned sparse expansion | Supervised | Inverted index |
HyDE is best understood as orthogonal rather than competing to most of the entries in this table: it modifies only the query path and can be layered on top of any dense retriever or even on BM25 (by using the generated pseudo-document as the search string), and it composes with reranking, hybrid sparse-plus-dense pipelines, and per-domain fine-tuning.[1][11]