GraphRAG is a graph-based approach to retrieval-augmented generation developed by Microsoft Research, first described publicly in February 2024 and formalized in the paper "From Local to Global: A Graph RAG Approach to Query-Focused Summarization" (arXiv:2404.16130), published in April 2024. Rather than retrieving text chunks by semantic similarity alone, GraphRAG builds a knowledge graph from source documents, detects communities of related entities using the Leiden algorithm, generates hierarchical natural-language summaries of those communities, and uses that structured representation at query time. The result is a system that can answer both narrow factual questions and broad, corpus-wide sensemaking questions that defeat conventional vector search.
Microsoft released the system as open-source software under the MIT License at github.com/microsoft/graphrag. In November 2024, the same team published LazyGraphRAG, a cheaper variant that defers almost all LLM use to query time, cutting indexing costs to roughly 0.1% of the original pipeline.
Conventional retrieval-augmented generation works by splitting documents into fixed-size text chunks, embedding each chunk with a vector embedding model, and storing the resulting vectors in an index. When a user poses a query, the system converts that query into a vector and retrieves the chunks whose embeddings are closest by cosine similarity. The retrieved chunks are then injected into an LLM prompt to generate a final answer.
This architecture works well for queries where the answer is localized: a user asking "What year was the Eiffel Tower built?" will find the relevant sentence in whichever chunk contains that sentence. The vector similarity between the question and the answer-containing passage is high enough to reliably surface it.
The architecture struggles with what Microsoft researchers call global sensemaking questions: queries directed at the corpus as a whole rather than at specific passages within it. Questions such as "What are the dominant themes in this dataset of financial reports?" or "How do the different authors in this archive disagree on climate policy?" require synthesizing information scattered across thousands of documents. No single chunk contains the answer; the answer is a property of the corpus in aggregate. Retrieving the top-k closest chunks does not solve this problem because relevance by chunk similarity does not correspond to relevance for a corpus-level question.
A second class of failures involves relationship queries: finding that Entity A and Entity B are related through a chain of intermediate entities that are never mentioned together in any single passage. Vector search retrieves individual passages, not relationship chains, so multi-hop reasoning is difficult to perform reliably.
GraphRAG was designed specifically to address the global sensemaking failure mode while preserving the ability to answer local, entity-specific questions.
The paper "From Local to Global: A Graph RAG Approach to Query-Focused Summarization" was submitted to arXiv on April 24, 2024 (revised February 19, 2025) by Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson, all affiliated with Microsoft Research or Microsoft Strategic Missions and Technologies.
The paper frames the problem as query-focused summarization over private text corpora that can run into the millions of tokens. Standard RAG, the paper argues, treats all questions as local retrieval problems. The contribution of GraphRAG is to provide a principled path for global questions by building a structured, hierarchical representation of the corpus before any query arrives.
The evaluation used two datasets: podcast transcripts from "Behind the Tech with Kevin Scott" (approximately 1 million tokens, yielding a graph of 8,564 nodes and 20,691 edges after indexing) and a set of news articles spanning September 2013 through December 2023 (approximately 1.7 million tokens, 15,754 nodes, 19,520 edges). The team tested multiple conditions, labeling them C0 through C3 to represent progressively finer levels of the community hierarchy, alongside a source-text map-reduce baseline (TS) and a vector RAG semantic search baseline (SS).
All GraphRAG conditions substantially outperformed vector RAG on comprehensiveness (72 to 83% win rates, p < .001 across datasets) and on diversity of perspectives (62 to 82% win rates). Vector RAG scored higher on directness, which the authors treated as a validation control: a system that retrieves a close passage should indeed give a more directly targeted answer to a local question, while GraphRAG's broader synthesis produces more comprehensive but less tightly focused responses. Root-level community summaries (C0) required 97% fewer tokens per query than source-text summarization, while the most detailed level (C3) still required 26 to 33% fewer tokens than the naive summarization baseline.
The paper also introduced a second evaluation approach based on counting factual claims extracted from each answer. LLM judgment of comprehensiveness aligned with claim-count comprehensiveness at a 78% rate, giving the evaluation some validation beyond pure LLM preference.
GraphRAG's architecture has two distinct phases: an indexing pipeline that runs once per corpus and a query engine that runs at request time. The indexing pipeline is computationally expensive and involves multiple LLM calls over every text unit in the corpus. The query engine consults the pre-built structures and uses a smaller number of LLM calls to synthesize an answer.
The first substantive step of the indexing pipeline is splitting source documents into text units. The default chunk size in the Microsoft implementation is 1,200 tokens (the paper used 600-token chunks with 100-token overlaps). Each text unit is then processed by an LLM to extract a structured list of entities (people, organizations, locations, concepts, events) along with the relationships between them and concise descriptions of each. A self-reflection prompt is used after the initial extraction to ask the model whether any entities or relationships were missed, improving recall at the cost of additional LLM calls.
Because the same real-world entity may appear under different names across different text units ("Microsoft," "Microsoft Corporation," "MSFT"), a deduplication and merging step consolidates entity descriptions. Multiple descriptions extracted from different text units are summarized by the LLM into a single canonical description per entity and per relationship. The result is a unified knowledge graph covering the entire corpus.
The paper notes that this extraction is domain-sensitive. The system ships with auto-tuning tooling that generates corpus-specific prompts to improve extraction quality for specialized text (legal documents, scientific papers, clinical notes, and so on).
The extracted entities and relationships form the nodes and edges of a property graph. Each node carries its canonical description. Each edge carries a description of the relationship, the strength of that relationship (derived from how many text units co-mention the two entities), and the source text units that support it. Optional claim extraction (disabled by default) can attach covariates to entities, such as time-bounded assertions about the entity's state.
The graph is stored as a set of tables (entities, relationships, covariates, communities, community reports, and text units) rather than as a native graph database, making the system compatible with standard data infrastructure. Third-party integrations with Neo4j, LangChain, and LlamaIndex allow teams to store and query these structures using graph databases if preferred.
Once the knowledge graph is built, GraphRAG applies the Leiden algorithm to partition the graph into communities of closely related entities. Leiden is an improvement over the older Louvain algorithm that addresses a specific failure mode: Louvain can produce internally disconnected communities, where nodes in the same community are not reachable from each other. Leiden guarantees that every identified community is internally connected, producing a more coherent grouping.
Leiden is also inherently hierarchical. Running the algorithm recursively produces a community tree: Level 0 contains the most fine-grained communities (a handful of tightly related entities), Level 1 aggregates multiple Level 0 communities into broader groupings, and higher levels continue until the entire graph is one root community. This hierarchy directly maps to the C0-C3 conditions evaluated in the paper: C0 is the root community, C1 is one level down, C2 and C3 are progressively finer.
The hierarchical structure is important because it controls the tradeoff between comprehensiveness and token cost at query time. Querying C0 (root-level summaries) consumes very few tokens per query but may lose fine-grained detail. Querying C3 (lowest-level summaries) produces more nuanced answers but requires processing many more summaries.
For each community at every level of the hierarchy, the pipeline makes one or more LLM calls to generate a natural-language community report. These reports describe the key entities in the community, the principal relationships among them, and the major themes or findings that emerge from combining the extracted information. An executive summary is written for the community as a whole, followed by more detailed subsections covering individual entities and relationships.
The community reports are the core retrieval unit for global search. Because they are precomputed, they are available instantly at query time without any additional LLM calls for summarization. The tradeoff is that they encode a snapshot of the corpus at indexing time; if the corpus changes, the reports must be regenerated for affected communities.
After community summarization, the pipeline generates vector embeddings for text units, entity descriptions, and community reports. These embeddings support semantic search during local queries, where the system needs to find entities or passages that are semantically similar to the user's question.
GraphRAG ships with three primary query modes, each suited to a different class of question.
Global search addresses corpus-wide sensemaking questions. When a user submits a query, the system retrieves community reports from an appropriate level of the hierarchy and generates partial answers from each relevant report in parallel. These partial answers are then aggregated through a second LLM call (a reduce step) into a final, synthesized answer. Partial answers that are scored as unhelpful by the LLM are discarded before reduction. Because every community at the queried level is consulted, global search can draw on information from anywhere in the corpus, at the cost of processing many summaries.
Local search addresses entity-specific questions. It begins with a vector similarity search over entity descriptions to identify entities most relevant to the query, then retrieves those entities' direct relationships, neighboring entities, associated community reports, and the source text units that support them. All of this context is combined into an LLM prompt that can run up to 8,000 tokens (configurable). Local search is faster and cheaper than global search but cannot synthesize information from parts of the graph that are not directly connected to the seed entities.
DRIFT search (Dynamic Reasoning and Inference with Flexible Traversal), introduced in late 2024, combines elements of both modes. It first performs a semantic search over community reports to establish an initial broad answer, then uses that answer to generate follow-up questions, which are executed as local searches in parallel. The parallel local results are re-ranked and merged to form the final answer. DRIFT achieves better breadth-depth balance than either pure global or pure local search and does so at substantially lower cost than full global search, because it does not process every community report.
The indexing pipeline is the primary cost driver for GraphRAG. Every text unit in the corpus must be processed by an LLM for entity extraction (with self-reflection), entity merging requires additional LLM summarization calls, and community report generation adds one more round of LLM calls per community across the full hierarchy.
The paper reported that indexing the 1-million-token podcast dataset using GPT-4-turbo took 281 minutes on a 16 GB RAM machine. Community detection uses the graspologic library and runs locally without LLM calls, so its cost is comparatively negligible.
Realworld cost reports from early users varied widely based on model choice and corpus size. Using GPT-4-turbo or GPT-4o, indexing a corpus in the hundreds of thousands of words cost tens to hundreds of dollars in API fees. The Microsoft team released a cost estimation tool to allow users to project expenses before committing to a full index run. Using cheaper models (GPT-4o-mini, open-source models via local inference) substantially reduces cost but may degrade extraction quality.
LangChain, LlamaIndex, and the official Python package all allow using any OpenAI-compatible API, which means teams can route calls through locally hosted models to avoid per-token charges entirely.
LazyGraphRAG was announced on November 25, 2024, in a Microsoft Research blog post by Darren Edge, Ha Trinh, and Jonathan Larson, the same core team that published the original GraphRAG paper.
The central observation motivating LazyGraphRAG is that the expensive parts of standard GraphRAG (LLM-based entity extraction and LLM-based community summarization) happen entirely at indexing time. If a team only needs to run a few queries over a corpus, paying the full indexing cost is not justified. LazyGraphRAG removes those steps from the indexing phase and defers all LLM use to query time.
During indexing, LazyGraphRAG uses NLP noun phrase extraction (not an LLM) to identify concepts in text units and builds a concept co-occurrence graph from those noun phrases. Community structure is derived from the co-occurrence graph using the same Leiden algorithm, but without LLM-generated community reports. The result is that LazyGraphRAG's indexing cost is identical to that of plain vector RAG and approximately 0.1% of the cost of full GraphRAG.
At query time, LazyGraphRAG generates 3 to 5 subqueries from the user's question using an LLM. For each subquery, it ranks text chunks by vector similarity, then ranks communities by aggregating the ranks of their top-k chunks. It then uses an LLM to assess sentence-level relevance within those communities, recursing into sub-communities when a given community does not contain relevant content, and terminating when the relevance test budget is exhausted or no relevant communities remain. This "iterative deepening" combines best-first and breadth-first traversal, concentrating LLM calls on the portions of the graph most likely to contain useful information.
The evaluation on 5,590 AP news articles with 100 synthetic queries (50 local, 50 global) showed that at a relevance test budget of 500 (the Z500 configuration, which costs about 4% of what a comparable GraphRAG Global Search call costs), LazyGraphRAG significantly outperformed all eight competing conditions tested, including plain vector RAG, RAPTOR, full GraphRAG global search, and DRIFT search, on both local and global query types. At a budget of 100 (Z100_Lite), it still matched or exceeded most baselines at a fraction of their cost.
The main tradeoff is that LazyGraphRAG requires more LLM calls per query than standard GraphRAG (whose community reports are already precomputed), so it is slower per query and its per-query cost rises with budget. For use cases with high query volume and a stable corpus, the full indexing cost of standard GraphRAG is amortized over many queries, making standard GraphRAG the more economical choice at scale.
As of mid-2025, LazyGraphRAG had been integrated into Microsoft Discovery and Azure Local services.
The table below summarizes key differences between plain vector RAG, GraphRAG, and hybrid approaches that combine the two.
| Dimension | Vector RAG | GraphRAG | Hybrid RAG |
|---|---|---|---|
| Index structure | Dense vector index | Knowledge graph + community hierarchy + vectors | Vector index + knowledge graph |
| Indexing cost | Low | High (many LLM calls) | Medium |
| Local (entity) queries | Strong | Strong (local search mode) | Strong |
| Global (corpus-wide) queries | Weak | Strong (global search mode) | Variable |
| Multi-hop reasoning | Weak | Strong | Strong |
| Relationship queries | Weak | Strong | Strong |
| Latency per query | Low | Medium to high | Medium |
| Corpus update handling | Easy (re-embed changed chunks) | Expensive (re-extract and re-summarize) | Partial |
| Hallucination risk | Moderate | Moderate (LLM extraction introduces noise) | Moderate |
| Transparency | Low (black-box similarity) | Medium (graph is inspectable) | Medium |
Vector RAG's primary advantage is simplicity and cost. It requires no domain-specific configuration, indexes quickly, and handles corpus updates by re-embedding only the changed chunks. It consistently outperforms GraphRAG on directness for local factual questions because the nearest-neighbor retrieval is well-matched to that query type.
GraphRAG's advantage grows with query complexity and corpus size. For questions that require synthesizing information across many documents or understanding global patterns in data, standard vector RAG accuracy drops substantially. Benchmarks have found that GraphRAG improves accuracy from around 50% to 80% on broad analytical queries over large corpora. For queries involving more than five entities, vector RAG accuracy degrades sharply while GraphRAG maintains stable performance.
Hybrid approaches combine a vector index with a knowledge graph, using techniques like Reciprocal Rank Fusion (RRF) to merge the ranked results from both retrieval systems. This gives strong performance on both local and relational queries at the cost of maintaining two separate indices and pipelines. LangChain, LlamaIndex, and Neo4j all provide hybrid RAG implementations that incorporate GraphRAG-style knowledge graphs.
The reference implementation is the official Microsoft Python package at github.com/microsoft/graphrag, released under the MIT License. It provides CLI tools for indexing (graphrag index), query (graphrag query), and prompt auto-tuning (graphrag prompt-tune), and stores outputs as Parquet files that can be consumed by downstream systems. The package is available via pip (pip install graphrag).
Neo4j has published detailed integration guides showing how to load GraphRAG's Parquet outputs into a Neo4j graph database and expose them through LangChain or LlamaIndex retrievers. Because Neo4j stores the entities and relationships natively, it can apply Cypher queries to the knowledge graph in addition to the standard GraphRAG search modes, enabling more precise relational traversals.
LangChain supports GraphRAG through its graph-aware document loaders and retriever abstractions. Teams can connect LangChain chains to a GraphRAG-built Neo4j instance, a FalkorDB instance, or any other graph store, and compose those retrievers with other LangChain components such as memory modules or agent executors.
LlamaIndex provides a GraphRAG cookbook and PropertyGraph abstractions that integrate with Neo4j. The LlamaIndex implementation allows developers to use LlamaIndex's pipeline configuration tools to set up the extraction, community detection, and summarization steps through YAML configuration rather than code.
FalkorDB, NebulaGraph, and Memgraph have each published integration guides and benchmarks comparing their graph stores as a backend for GraphRAG. Microsoft's Azure AI Foundry service includes GraphRAG as a managed option, allowing enterprise users to run the indexing pipeline and query engine without managing the underlying infrastructure.
Beyond the official implementation, the broader research community has developed variants. RAPTOR (Recursive Abstractive Processing for Tree-Organized Retrieval, Stanford 2024) independently converged on a similar idea of recursive summarization, though it builds trees rather than graphs. HippoRAG and PathRAG are other graph-based retrieval variants that optimize different aspects of the traversal. OG-RAG uses ontology grounding during extraction to reduce hallucinations by constraining entity types to a predefined schema.
GraphRAG's design makes it particularly well-suited to a specific set of information retrieval problems.
Question answering over large private corpora is the primary use case described in the original paper. Internal corporate knowledge bases, archives of customer support tickets, collections of legal documents, and repositories of scientific literature all share the property that useful information is distributed across many documents and that users need answers requiring synthesis across that distribution. A legal team querying a contract archive for clauses with a particular risk profile, or a researcher asking what an archive of clinical trial reports says about a particular mechanism, benefit from the global summarization capabilities that GraphRAG provides.
Thematic analysis and corpus summarization are tasks where GraphRAG's community hierarchy is directly useful even without a specific question. Running global search at the root community level effectively generates an executive summary of the entire corpus organized around the principal entity clusters. News archives, corporate strategy documents, and academic literature collections can be mapped in this way without manual curation.
Competitive intelligence and investigative research involve building a picture of how organizations, people, products, and events relate to each other across many documents. Because the knowledge graph explicitly encodes these relationships, multi-hop queries ("What organizations does Person A fund that also partner with Organization B?") are tractable in a way they are not with pure vector search.
Streaming and exploratory data analysis is a use case better served by LazyGraphRAG than by standard GraphRAG. When a corpus is changing frequently or when only a handful of queries will be run, the up-front indexing cost of standard GraphRAG is not justified. LazyGraphRAG's near-zero indexing cost and flexible quality-cost tradeoff via the relevance budget make it well suited to one-off analyses.
For teams evaluating GraphRAG for production deployment, cost deserves careful analysis at both the indexing and query stages.
At indexing time, the dominant cost driver is the number of LLM calls multiplied by the cost per token of the chosen model. A rough rule of thumb from community benchmarks is that indexing 1 million tokens of source text with GPT-4o costs on the order of $10 to $30 depending on extraction complexity and the number of self-reflection passes. A very large corpus of 100 million tokens could cost $1,000 to $3,000 to index with a mid-tier model. Using a local or open-source model reduces this to infrastructure cost only.
At query time, global search is more expensive than local search because it processes many community summaries. A single global search query can consume thousands of tokens in processing community reports, while a local search query is bounded by the context window of a single LLM call. Teams with high query volume and a stable corpus will generally find that the per-query cost of global search justifies the up-front indexing investment once the query count is high enough to amortize it.
LazyGraphRAG's 0.1% indexing cost changes the economics substantially for low-query-volume use cases. The tradeoff is that per-query cost at the Z500 budget level is still higher than a plain vector RAG query (because multiple LLM relevance tests are run), so LazyGraphRAG's total cost over many queries will eventually exceed that of standard GraphRAG with precomputed summaries.
Several optimization strategies have been documented by practitioners. Reducing the chunk overlap (or setting it to zero) decreases the number of extraction calls. Using a cheaper model for entity extraction and a stronger model only for community summarization and query answering captures most of the quality benefit at reduced cost. Selective re-indexing (re-running extraction only on newly added documents rather than the full corpus) reduces update costs for growing corpora.
GraphRAG has several practical limitations that constrain its applicability.
High indexing cost is the most frequently cited barrier to adoption. For teams exploring a new dataset or running exploratory analyses, paying the full indexing cost before seeing any results is a significant commitment. LazyGraphRAG mitigates this but does not fully eliminate it for all deployment scenarios.
Graph extraction quality is sensitive to the quality of the underlying LLM and to the domain of the text. General-purpose extraction prompts may miss specialized entities or relationships in technical domains. The auto-tuning system helps, but tuning requires running the pipeline on a sample of the target corpus, which adds a setup step before production use.
Corpus staleness is a problem for any system that relies on a precomputed index. When source documents change, the extracted graph, community structure, and community reports must all be updated. Unlike vector RAG, where updating requires only re-embedding the changed chunks, GraphRAG updates may cascade: a change in one document affects extraction of entities and relationships, which may alter community membership, which may require re-generating community reports for affected communities. Efficient incremental updating remains an open research problem.
Hallucination risk exists at each LLM-mediated step: entity extraction, relationship extraction, entity description summarization, and community report generation all involve LLM calls that can introduce errors or fabrications. Because the knowledge graph is built from LLM-extracted content rather than from exact text, errors made during indexing are encoded into the permanent structure of the graph and propagate into query answers. Post-extraction verification tooling is available but not yet standard practice.
Latency is higher than vector RAG for complex queries. Global search queries process many community summaries in parallel, which requires multiple LLM calls even with batching. On a large corpus, a global search query can take tens of seconds.
The system is primarily designed for text corpora. Structured data (tables, databases), code repositories, or multi-modal content (images, audio) require additional preprocessing or specialized extraction pipelines that are not included in the standard implementation.
Microsoft's published roadmap and research blog posts point to several active development areas.
Dynamic community selection is an ongoing research thread. Rather than pre-selecting a fixed community level (C0, C1, C2, C3), a query-adaptive system would select the appropriate level dynamically based on the estimated complexity of the incoming question. A 2024 Microsoft Research post described work in this direction, reporting improvements in both answer quality and token efficiency over the fixed-level approach.
Incremental indexing, allowing the graph to be updated with new documents without full re-indexing, is a practical engineering priority cited repeatedly by the community. Several third-party implementations have proposed partial solutions, but a robust, general approach integrated into the official pipeline does not yet exist as of early 2026.
Multi-modal GraphRAG is an area of active interest. Extending entity extraction to images, audio transcripts, and structured tables would allow the knowledge graph to capture relationships across heterogeneous data types, which is a common requirement in domains like medical research, where information appears in both clinical notes and imaging reports.
Agent integration is a natural extension. LLM agents that have access to a GraphRAG knowledge graph can use the graph as a structured external memory, querying it as one tool among many. Mem0 and similar memory management systems provide related infrastructure for persistent agent memory; combining these with GraphRAG's structured representation is an active research direction.
Scalability beyond the million-token range is another open question. The original paper evaluated on corpora up to 1.7 million tokens. Corpora of hundreds of millions of tokens (enterprise document repositories, legal archives, entire scientific fields) require architectural changes to the extraction and community detection pipelines to remain tractable.