Knowledge graph

A knowledge graph is a structured representation of real-world entities and the relationships between them, organized as a network of interconnected nodes and edges. Each node represents an entity (such as a person, place, organization, product, or concept), and each edge represents a relationship between two entities (such as "born in", "founded by", or "is a type of"). Knowledge graphs store facts in a machine-readable format that lets humans and software systems query, reason over, and derive new insights from large bodies of information.

The term entered mainstream usage on May 16, 2012, when Google introduced its Google Knowledge Graph to enhance search results with structured information panels ^[1]. Since then, knowledge graphs have become foundational infrastructure for artificial intelligence applications ranging from search engines and recommendation systems to drug discovery and fraud detection. As of 2026, Gartner identifies knowledge graphs as a critical enabler for generative AI, with hybrid retrieval architectures combining graphs and vector indexes becoming the recommended approach for production systems ^[2].

History

Early foundations

The ideas underlying knowledge graphs stretch back decades. In the 1960s and 1970s, researchers in artificial intelligence developed semantic networks, graph structures that represented concepts and their relationships. M. Ross Quillian's semantic memory model (1968) and Marvin Minsky's frames (1975) attempted to encode human knowledge in formats that machines could process. These early systems suffered from a lack of standardization and limited scale, but they established the basic insight that meaning could be represented as a labeled graph of concepts.

In the 1980s, Doug Lenat's Cyc project at MCC began an ambitious attempt to encode commonsense knowledge by hand. Cyc accumulated millions of assertions about everyday concepts and influenced later commonsense projects such as ConceptNet and OpenCyc, despite criticism for its slow growth and brittle performance.

The Semantic Web (1999 to 2008)

The modern lineage of knowledge graphs runs through the Semantic Web initiative. In May 2001, Tim Berners-Lee, James Hendler, and Ora Lassila published an article titled "The Semantic Web" in Scientific American ^[3]. The article described a vision in which web content would carry machine-readable meaning, expressed as triples of the form subject, predicate, object. The triples could be written in XML and combined across sites so that software agents could perform complex tasks like booking medical appointments or comparing products.

The World Wide Web Consortium (W3C) standardized the technical scaffolding for this vision. The Resource Description Framework (RDF) became a recommendation in February 1999 and was revised in 2004. RDF defines a data model in which every fact is a triple, with each component identified by a URI. The Web Ontology Language (OWL), released in February 2004, added richer vocabulary for defining classes, properties, cardinalities, and logical constraints. The query language SPARQL reached W3C recommendation status on January 15, 2008, giving the ecosystem a standard way to ask graph-shaped questions across distributed RDF data ^[4].

The Semantic Web vision in its purest form did not become the dominant model of the web, but the underlying ideas of identifiers, triples, ontologies, and graph queries became the technical core of every major knowledge graph that followed.

DBpedia and Freebase (2007)

Two landmark projects launched in 2007 brought large-scale, publicly available knowledge graphs into existence.

DBpedia was started by Soren Auer, Christian Bizer, Jens Lehmann, and collaborators from the University of Leipzig and Freie Universitat Berlin. Their paper "DBpedia: A Nucleus for a Web of Open Data" appeared at the 6th International Semantic Web Conference (ISWC) in November 2007 ^[5]. DBpedia automatically extracts structured information from Wikipedia infoboxes and converts it into RDF triples. By parsing the semi-structured data already present in Wikipedia, DBpedia turned the implicit structure of the encyclopedia into a queryable graph of millions of entities, and it became the central hub of the Linked Open Data cloud.

Freebase, created by the Metaweb startup, took a different approach. Launched publicly in March 2007, Freebase invited users to contribute and curate facts about real-world entities in a collaborative, open knowledge base. Metaweb's data model used typed entities and explicit schemas, and the project promoted the Metaweb Query Language (MQL) for retrieval. Google acquired Metaweb in July 2010, in part to obtain Freebase's entity data for the search engine that would later become the Google Knowledge Graph ^[6].

YAGO (2007)

In the same year, Fabian Suchanek, Gjergji Kasneci, and Gerhard Weikum at the Max Planck Institute for Informatics published YAGO (Yet Another Great Ontology) at the 16th International World Wide Web Conference ^[7]. YAGO combined facts derived from Wikipedia infoboxes and category structure with the WordNet lexical database, producing a clean taxonomy of more than one million entities and five million facts at first release. The combination of crowd-extracted facts with a curated upper ontology gave YAGO unusually high precision, and the project went on to receive the 2018 Seoul Test of Time Award. Subsequent versions, including YAGO3 and YAGO 4.5, have grown to over 17 million entities and 150 million facts, organized under the schema.org taxonomy ^[8].

NELL and other extraction projects (2010)

In January 2010, Tom Mitchell and his group at Carnegie Mellon University launched the Never-Ending Language Learning system, NELL ^[9]. NELL was designed to read the web continuously and learn new facts day after day, building up its own knowledge base from a small initial seed of categories and relations. By late 2010 NELL had learned more than 440,000 facts, and it has continued to run for over a decade. NELL became one of the touchstones for research on large-scale, weakly supervised knowledge acquisition.

In parallel, Mike Mintz, Steven Bills, Rion Snow, and Daniel Jurafsky published "Distant supervision for relation extraction without labeled data" at ACL-IJCNLP 2009 ^[10]. Their idea was to use facts already in Freebase as automatic labels for relation extraction: for any entity pair that appears in some Freebase relation, every sentence in a large corpus mentioning both entities is treated as a positive training example. Distant supervision became the dominant paradigm for training relation extractors at scale and supplied much of the data behind subsequent knowledge graph construction systems.

Google Knowledge Graph (May 16, 2012)

On May 16, 2012, Google announced the Google Knowledge Graph in a blog post by senior vice president Amit Singhal titled "Introducing the Knowledge Graph: things, not strings" ^[1]. The new system displayed structured information panels (Knowledge Panels) alongside traditional ranked search results. The tagline captured the underlying shift from keyword matching to entity understanding.

The initial release drew on Freebase, Wikipedia, the CIA World Factbook, and several other sources. Within seven months, the Knowledge Graph tripled in size, covering 570 million entities and 18 billion facts. By mid-2016, Google reported 70 billion facts in the Knowledge Graph, which was answering roughly one-third of the 100 billion monthly searches Google handled. By 2020, Google reported more than 5 billion entities. The Knowledge Graph also expanded geographically, rolling out to Spanish, French, German, Portuguese, Japanese, Russian, Italian, and other languages within months of its English debut.

Wikidata (October 29, 2012) and the Freebase migration

Five months after Google's announcement, the Wikimedia Foundation launched Wikidata on October 29, 2012, with development funded by Wikimedia Deutschland and partners including the Allen Institute for Artificial Intelligence and Google ^[11]. Wikidata is a free, open, multilingual knowledge base that any person or bot can edit. Each item has a stable Q identifier (Albert Einstein is Q937), and each property has a P identifier (occupation is P106). Wikidata's content is multilingual by design and is licensed under CC0, making it usable in any context.

In December 2014, Google announced that Freebase would be shut down and its data migrated to Wikidata. The shutdown completed on May 2, 2016 ^[6]. The Freebase-to-Wikidata migration, while only partial because of Wikidata's higher notability bar, established Wikidata as the de facto open successor to Freebase. By 2025, Wikidata contained over 115 million data items and approximately 1.65 billion statements, making it the largest collaborative knowledge graph in the world.

Resurgence with LLMs (2023 onward)

For several years after 2016, knowledge graph research moved at a steady but unglamorous pace while attention went to deep learning and language models. The release of GPT-4 in March 2023 and the spread of LLM applications brought knowledge graphs back to the center of attention. Two trends drove the resurgence: LLMs proved unreliable on factual recall, motivating retrieval systems that ground generation in structured knowledge, and LLMs themselves dramatically reduced the cost of extracting entities and relations from text. Microsoft's GraphRAG paper in April 2024 became a flashpoint, and by 2025 hybrid graph-plus-vector retrieval was widely adopted in production AI systems.

Schema and structure

Knowledge graphs represent information using a graph data model. The core building blocks are entities, relationships, properties, and an ontology that gives them types.

Entities (nodes)

Entities are the fundamental objects in a knowledge graph. Each entity represents a distinct real-world thing: a person, an organization, a location, a product, an event, or an abstract concept. Every entity is assigned a unique identifier. In Wikidata, for example, Albert Einstein is identified as Q937, the concept of physicist is Q169470, and the city of Ulm is Q3012. Stable identifiers make it possible to merge information across sources without ambiguity.

Relationships (edges)

Relationships connect pairs of entities and describe how they are related. A relationship is a directed, labeled edge in the graph. The fact "Albert Einstein worked at the Institute for Advanced Study" can be represented as the directed edge Q937 (Einstein) -> P108 (employer) -> Q14708 (Institute for Advanced Study). Relationships in most knowledge graphs are also typed entities in their own right, so they can have their own properties (such as a domain, range, or inverse).

Triples

The fundamental unit of information in a knowledge graph is the triple, also called a statement, consisting of three parts: subject, predicate, and object. Examples:

(Albert Einstein, bornIn, Ulm)
(Albert Einstein, occupation, Physicist)
(Ulm, locatedIn, Germany)

Triples follow the RDF standard, and they can be stored in specialized databases called triple stores that support SPARQL queries. Some implementations extend the model to quads (subject, predicate, object, named graph) so that statements can be grouped by source or context.

Properties

Entities can also have properties: attributes that take literal values (strings, numbers, dates) rather than pointing to other entities. For example, Einstein has a birthDate of March 14, 1879, and a deathDate of April 18, 1955. The line between properties and relationships is sometimes blurry; in RDF, both are encoded as triples whose object is either an IRI (relationship) or a literal (property).

Ontologies and schema

An ontology provides the conceptual schema for a knowledge graph: it defines the types of entities, the types of relationships allowed between them, and the constraints that govern the graph's structure. Ontologies are typically written in OWL or RDFS (RDF Schema).

For example, an ontology might specify that a Person can have an occupation pointing to a Profession but not to a City. Schema enforcement supports automated reasoning. If the ontology states that every University is an Organization and that MIT is a University, a reasoner can infer that MIT is an Organization without that fact being stored explicitly. The distinction matters: the ontology supplies the conceptual framework and consistency rules, while the graph itself holds the actual instance data about real-world entities ^[12].

Not every knowledge graph uses heavyweight ontologies. Wikidata, for example, has a deliberately loose schema that emerges from community use. Property graphs, the data model behind Neo4j and similar databases, omit RDF and OWL entirely in favor of nodes and edges with arbitrary key-value attributes.

Types of knowledge graphs

Knowledge graphs can be classified by their scope and purpose.

General-purpose knowledge graphs

General-purpose graphs aim to represent broad world knowledge across many domains. The Google Knowledge Graph, Wikidata, and DBpedia are the canonical examples, with millions to billions of entities spanning people, places, organizations, events, scientific concepts, and more. They power search engines, virtual assistants, and any application that needs broad factual grounding.

Domain-specific knowledge graphs

Domain-specific graphs focus on a particular field and capture specialized knowledge that general-purpose graphs typically lack. Biomedical examples include the Unified Medical Language System (UMLS), DrugBank, and Bio2RDF, which model relationships between diseases, genes, drugs, and biological pathways. Financial knowledge graphs map corporate ownership structures, supply chains, and regulatory filings. Legal knowledge graphs connect statutes, case law, legal concepts, and judicial opinions. Cybersecurity graphs encode threat actors, attack patterns, and indicators of compromise.

Enterprise knowledge graphs

Organizations build internal knowledge graphs that integrate data from databases, CRM systems, product catalogs, internal documents, and other proprietary sources. They give a unified view of organizational knowledge and power internal search, analytics, and AI applications. LinkedIn's Economic Graph models members, jobs, companies, and skills. Airbnb's knowledge graph covers listings, hosts, neighborhoods, and amenities. Amazon's Product Graph spans products, categories, brands, and customer concepts.

Notable knowledge graphs

The following table summarizes the most prominent knowledge graphs.

Knowledge Graph	Creator	Launch Year	Type	Scale	Primary Use
Google Knowledge Graph	Google	2012	Proprietary, general-purpose	5+ billion entities, 70+ billion facts (2016)	Search, Google Assistant, Maps
Wikidata	Wikimedia Foundation	2012	Open, general-purpose	115+ million items, 1.65 billion statements	Wikipedia support, open data
DBpedia	Leipzig, Mannheim, FU Berlin	2007	Open, general-purpose	~6 million entities (English)	Linked data, NLP research
YAGO	Max Planck Institute	2007	Open, general-purpose	17+ million entities, 150+ million facts	Research, NLP benchmarks
Freebase	Metaweb / Google	2007 (closed 2016)	Open (historical)	~3 billion facts	Powered early Google Knowledge Graph
ConceptNet	MIT Media Lab	1999 (as OMCS)	Open, commonsense	21 million edges, 8 million nodes	Commonsense reasoning, NLU
WordNet	Princeton University	1985	Lexical database	117,000 synsets	Lexical semantics, NLP
NELL	Carnegie Mellon	2010	Continuous learning	50+ million beliefs	Research on never-ending learning
Microsoft Concept Graph (Probase)	Microsoft Research	2016	Concept hierarchy	5.4 million concepts, 12 million isA edges	Short-text understanding
OpenCyc	Cycorp	2002 (closed 2017)	Open subset of Cyc	239,000 concepts	Commonsense reasoning
Microsoft Academic Graph	Microsoft	2015 (closed 2021)	Open (historical)	250+ million publications	Academic search and analysis
UMLS	US National Library of Medicine	1986	Biomedical	3.49 million concepts (2025AB)	Clinical NLP, biomedical research
DrugBank	University of Alberta	2006	Pharmaceutical	500,000+ drug entries	Drug discovery, pharmacology
Bio2RDF	Open consortium	2008	Life sciences	11 billion+ triples	Life sciences research
UniProt	UniProt Consortium	2002	Open, biomedical	250+ million protein sequences	Protein research, bioinformatics
GeoNames	GeoNames	2005	Open, geographic	12+ million place names	Geocoding, mapping

WordNet, while strictly a lexical database rather than a knowledge graph, is included because it served as a backbone for many graphs (notably YAGO) and continues to provide upper-level taxonomy for entity types. Microsoft Concept Graph, built on the Probase research project, is unusual in that it captures the typicality of isA relations: it not only knows that a robin is a bird, but also that birds are more typically robins than penguins, which helps with short-text understanding ^[13].

Construction methods

Building a knowledge graph requires extracting structured information from diverse sources. Several methods are typically combined.

Manual curation

Human experts review sources and create entities, relationships, and properties by hand. The approach produces high-quality, reliable data but is expensive and does not scale well. Wikidata relies heavily on volunteer contributors who add and verify statements, supported by automated bots that handle routine tasks ^[11]. Wikidata as of 2025 has more than 24,000 monthly contributors, which is large for a curated project but tiny compared to the volume of statements needed.

Information extraction from text

Natural language processing techniques automatically extract structured information from unstructured text. The pipeline typically involves several steps:

Named Entity Recognition (NER) identifies and classifies entities (people, organizations, locations) mentioned in text.
Entity Linking (NEL) maps extracted entity mentions to existing entities in the knowledge graph, resolving ambiguity. "Apple" might refer to the company (Q312), the fruit (Q89), or any number of other entities.
Relation extraction determines the relationships between identified entities, often using transformer-based models.
Coreference resolution identifies when different expressions refer to the same entity ("Einstein", "he", "the physicist").

Open Information Extraction (OpenIE) is a related approach that extracts (subject, relation, object) tuples without requiring a fixed schema. The line of work began with Banko et al.'s TextRunner system in 2007 and continued through ReVerb (2011), OLLIE, and ClausIE ^[14]. OpenIE produces high recall but messy output, since the relation strings are arbitrary phrases from the source text rather than ontology-aligned predicates.

Distant supervision

Mintz et al.'s 2009 distant supervision technique uses an existing knowledge graph (originally Freebase) as a source of automatic training labels. For any entity pair that appears in some known relation, sentences in a large corpus that mention both entities are treated as positive examples. Distant supervision is noisy but eliminates the bottleneck of manual annotation, and it has trained most modern relation extractors at scale ^[10].

Crowdsourcing

Wikidata and the former Freebase rely on large communities of contributors. Crowdsourcing combines the scale of automated methods with human quality control, though it introduces challenges around contributor reliability, vandalism, and edit disputes. Wikidata uses community vetting, automated edit filters, and rollback bots to keep quality manageable.

Web scraping and wrapper induction

Semi-structured data from websites (tables, infoboxes, product listings) can be systematically extracted and converted into triples. DBpedia's extraction of Wikipedia infoboxes is the canonical example, and wrapper induction systems learn extraction templates for whole classes of similarly formatted pages.

Integration of existing databases

Enterprise knowledge graphs are often built by integrating data from relational databases, APIs, spreadsheets, and other structured sources. Schema mapping translates fields and tables from each source into the target ontology, and entity resolution (sometimes called record linkage) identifies when records in different sources refer to the same real-world entity. The challenge of merging "John Smith" in one database with "J. Smith" in another, without confusing him with a different John Smith, scales rapidly with the number of sources.

LLM-augmented construction

Large language models have changed the economics of knowledge graph construction. A modern pipeline can prompt an LLM to read a document, extract entity mentions, classify their types, and produce candidate relations in a single pass. Microsoft's GraphRAG indexer uses this approach end to end. Research throughout 2024 and 2025 has shown that LLMs can match or exceed earlier supervised models on entity and relation extraction, particularly for complex or domain-specific documents ^[15]. Hybrid pipelines that combine LLM extraction with human review now dominate domain knowledge graph construction.

Knowledge graph completion

Every knowledge graph is incomplete. Some facts are missing because no source has recorded them, others because the extraction pipeline failed, and others still because the world has changed. Knowledge graph completion is the task of predicting missing edges and properties, typically by training a model on the existing graph and using it to score candidate triples.

Link prediction is the most studied subproblem: given a head entity and a relation, predict the most likely tail entity (and vice versa). Standard benchmarks include FB15k and FB15k-237 (subsets of Freebase), WN18 and WN18RR (subsets of WordNet), and YAGO3-10. Triple classification, entity classification, and relation prediction round out the common tasks.

Three broad families of methods address completion: embedding-based approaches that project entities and relations into a vector space, rule-based approaches that learn logical patterns over the graph, and neural-symbolic hybrids that combine the two. Embedding methods dominate research output, but rule mining systems such as AMIE remain useful in production because their outputs are interpretable.

Knowledge graph embeddings

Knowledge graph embeddings are learned vector representations of entities and relationships in a continuous, low-dimensional space. They turn link prediction into vector arithmetic and let downstream models combine graph signals with text and image embeddings.

Major embedding models

The field has produced a long list of embedding models. The table below covers the most influential.

Model	Year	Family	Core idea
TransE	2013	Translational	Models a relation as a translation: h + r should equal t in the embedding space ^[16]
TransH	2014	Translational	Projects entities onto a relation-specific hyperplane before translating, handling 1-to-N and N-to-N relations
TransR	2015	Translational	Maintains separate spaces for entities and relations, with a per-relation projection matrix
TransD	2015	Translational	Uses dynamic mapping matrices that depend on both the entity and the relation
RESCAL	2011	Tensor factorization	Represents the graph as a 3-way tensor and factorizes it with full per-relation matrices
DistMult	2015	Bilinear	Restricts each per-relation matrix to a diagonal, scaling well but only handling symmetric relations ^[17]
HolE	2016	Holographic	Combines entity vectors with circular correlation, capturing more interactions than DistMult
ComplEx	2016	Complex-valued	Extends DistMult to complex-valued embeddings, capturing antisymmetric relations through the Hermitian dot product ^[18]
ConvE	2018	Convolutional	Reshapes head and relation embeddings into a 2D matrix, applies 2D convolution, and projects back; very parameter efficient ^[19]
RotatE	2019	Rotational	Models a relation as a rotation in complex space, capturing symmetry, antisymmetry, inversion, and composition ^[20]
R-GCN	2018	Graph neural network	Uses a graph neural network with per-relation weights to aggregate neighbor features ^[21]
CompGCN	2020	Graph neural network	Composes entity and relation embeddings during message passing
KGAT	2019	Attention	Applies attention over the neighborhood for recommendation tasks
QuatE	2019	Quaternion	Generalizes RotatE to quaternion space for richer rotations
TuckER	2019	Tensor factorization	Uses a Tucker decomposition of the binary tensor, achieving strong link prediction scores

TransE, introduced by Antoine Bordes and colleagues at NeurIPS 2013, is the foundational model ^[16]. The idea is striking in its simplicity: if (h, r, t) is a fact, then the embedding vectors should satisfy h + r approximately equal t. Training minimizes the distance between h + r and t for true triples and maximizes it for corrupted triples produced by replacing the head or tail. TransE is fast and scales to graphs with millions of entities, and it remains a standard baseline. Its main limitation is that it cannot represent one-to-many or many-to-many relations cleanly, since translation is a one-to-one operation.

The Trans family of models (TransH, TransR, TransD) addressed this by projecting entities into relation-specific subspaces. The bilinear family started with RESCAL in 2011 and produced DistMult and ComplEx, with ComplEx adding the imaginary component needed to model antisymmetric relations such as parentOf ^[18]. RotatE in 2019 unified many of these properties by modeling each relation as a rotation in a complex plane, drawing on Euler's identity and supporting symmetry, antisymmetry, inversion, and composition simultaneously ^[20].

The more recent generation uses graph neural networks. R-GCN introduced relation-specific convolutions over the graph, allowing entity representations to depend on multi-hop neighborhoods ^[21]. CompGCN, KGAT, and other extensions further refined the message passing scheme. As of 2025, GNN-based models are the strongest performers on standard link prediction benchmarks, with embedding plus inductive reasoning systems such as NBFNet and AnyBURL closing the gap on harder generalization tests.

Uses of embeddings

Knowledge graph embeddings support more than completion. They feed into downstream classifiers, recommendation systems, and entity resolution, and they act as priors in question answering systems that first retrieve candidate entities by embedding similarity before refining the answer with explicit graph queries. In drug discovery, embedding-based scoring has been used to rank candidate gene-disease associations and predict adverse drug-drug interactions.

Query languages

A knowledge graph is only useful if it can be queried efficiently. Three query languages dominate practice.

SPARQL is the W3C standard for RDF graphs, recommended in 2008 and updated to SPARQL 1.1 in 2013 ^[4]. A SPARQL query consists of a graph pattern with variable placeholders. Solutions are produced by binding variables to graph terms in a way that satisfies the pattern. SPARQL supports unions, optional patterns, filters, aggregations, federated queries across multiple endpoints, and updates through SPARQL Update. Wikidata, DBpedia, and most public RDF endpoints support SPARQL.

Cypher is the query language of Neo4j, the leading property graph database. Cypher uses an ASCII-art notation in which nodes are written as (node) and edges as -[edge]->, which makes patterns visually intuitive. Cypher has become the de facto standard for property graphs and forms the basis of the openCypher specification. The new ISO GQL standard, finalized in 2024, draws heavily on Cypher syntax.

Gremlin is a graph traversal language from Apache TinkerPop. Where SPARQL and Cypher are declarative, Gremlin is imperative: a query is a sequence of steps that walk the graph (such as g.V().has('name','Einstein').out('bornIn')). Gremlin runs on top of any TinkerPop-compliant database, including JanusGraph, Amazon Neptune, and Cosmos DB.

Storage

Knowledge graphs are stored in two main classes of database, with growing convergence between them.

Triple stores natively store RDF triples and provide SPARQL endpoints. Apache Jena (with the TDB and Fuseki components), OpenLink Virtuoso, Ontotext GraphDB, Stardog, AllegroGraph, and Blazegraph (the engine that powered the Wikidata Query Service for years) are the major implementations. Triple stores excel at standards-compliance, federated queries, and OWL reasoning. Their performance on graph traversal queries can lag specialized graph databases, although the gap has narrowed.

Property graph databases store nodes and edges with attached key-value properties. Neo4j is the market leader and the original property graph database. Amazon Neptune supports both RDF and property graphs through SPARQL and Gremlin. TigerGraph emphasizes high-throughput parallel processing for very large graphs. ArangoDB combines property graphs with documents and key-value storage. Memgraph offers an in-memory option focused on real-time analytics. JanusGraph, the open source successor to Titan, runs on top of distributed back ends like Cassandra and HBase.

A newer category, sometimes called graph virtualization, exposes a graph layer over existing relational or columnar stores; PuppyGraph and Apache AGE (Postgres-based) are recent examples. Vector indexes such as Pinecone, Weaviate, and Qdrant are increasingly paired with graph databases in hybrid architectures, since semantic similarity and structured traversal answer different kinds of questions.

Reasoning

Reasoning over a knowledge graph means inferring new facts, checking consistency, or answering complex questions that require chaining together multiple stored facts. Three families of reasoning are commonly distinguished.

Logical inference applies the deductive rules expressed in an ontology. OWL reasoners such as Pellet, HermiT, and FaCT++ derive new triples from class hierarchies, property characteristics, and cardinality constraints. If the ontology says every CEO is an Executive and the graph says Tim Cook is the CEO of Apple, an OWL reasoner concludes that Tim Cook is an Executive. Logical reasoning is sound and explainable but does not handle exceptions, uncertainty, or missing data well.

Path-based reasoning uses random walks or learned path scoring to predict relations. The Path Ranking Algorithm (PRA), introduced by Lao and Cohen in 2010, learns logistic regression weights over paths between entities. Subsequent systems such as DeepPath and MINERVA used reinforcement learning to walk the graph, while AMIE and AnyBURL mine logical rules in the spirit of inductive logic programming.

Embedding-based reasoning, as covered in the embeddings section, scores candidate triples in a learned vector space. It generalizes well across the graph and handles uncertainty naturally, but its outputs are not interpretable in the way that explicit rules are.

Neural-symbolic methods aim to combine the strengths of all three. Models such as Neural Theorem Provers, ExpressGNN, and the Logic Tensor Networks family use neural networks to learn embeddings while constraining them to satisfy symbolic rules. The 2024 to 2026 wave of research on KG plus LLM systems, where the LLM provides commonsense and natural language understanding while the graph provides factual grounding, is the latest entry in this long-running effort.

Applications

The table below summarizes the major application domains for knowledge graphs.

Domain	Example use	Representative systems
Web search	Knowledge Panels, entity-linked search results, structured snippets	Google Knowledge Graph, Bing Satori
Question answering	Translating natural language to SPARQL, multi-hop reasoning	IBM Watson, Wikidata Query Service, KGQA systems
Virtual assistants	Entity-grounded answers, slot filling, follow-up resolution	Google Assistant, Siri, Alexa
Recommendation	Knowledge-aware collaborative filtering, cold-start recommendation	Pinterest's Pinnerverse, Spotify, Amazon
Drug discovery	Drug repurposing, target identification, adverse interaction prediction	BenevolentAI, Insilico Medicine, Hetionet
Fraud detection	Transaction graphs, shared-device detection, fake account rings	PayPal, banks, AML platforms
Cybersecurity	Threat actor profiling, indicator-of-compromise correlation	MITRE ATT&CK, AlienVault OTX
Enterprise search	Unified search across documents, code, and structured systems	LinkedIn, Bloomberg, Microsoft Graph
Compliance and legal	Statute and case law graphs, contract analysis	Westlaw Edge, LexisNexis
Scientific research	Literature graphs, citation networks, materials and biology graphs	OpenAlex, Semantic Scholar, Materials Project
Supply chain	Supplier networks, parts traceability, risk propagation	Manufacturing ERP, logistics platforms
Generative AI grounding	RAG over structured data, hallucination reduction	GraphRAG, LightRAG, enterprise KG-RAG systems

Search and information retrieval

Google's Knowledge Graph remains the most visible example. When a user searches for Albert Einstein, the Knowledge Panel showing his birth date, notable works, and related people is powered by the Knowledge Graph. The structured representation lets the search engine understand queries at the entity level rather than relying solely on keyword matching, and it lets results be syndicated into voice answers, image carousels, and Maps.

Question answering

Knowledge graph question answering (KGQA) systems translate natural language questions into structured queries (typically SPARQL) that are executed against a knowledge graph. The question "Who founded the company that made the iPhone?" requires traversing two relationships: iPhone -> manufacturer -> Apple, and Apple -> founder -> Steve Jobs. Knowledge graphs supply the structured data that makes multi-hop reasoning possible. Modern KGQA pipelines combine LLM-based query generation with graph retrieval and answer ranking.

Recommendation

E-commerce platforms and streaming services use knowledge graphs to model relationships between products, users, genres, and attributes. Knowledge graph-based recommendations can explain why an item was recommended ("Because you liked science fiction films directed by Denis Villeneuve") and address the cold-start problem by leveraging entity attributes even when user interaction data is sparse. Pinterest's Pinnerverse and similar systems combine knowledge graph signals with embedding-based collaborative filtering.

Drug discovery

Pharmaceutical companies use biomedical knowledge graphs to identify potential drug targets, predict drug interactions, and repurpose existing drugs for new conditions. By modeling relationships between genes, proteins, diseases, pathways, and chemical compounds, researchers can computationally explore hypotheses that would take years to test experimentally. Hetionet, BenevolentAI, and Insilico Medicine all build on this approach, and several COVID-19 repurposing studies used biomedical knowledge graphs to surface candidate leads within weeks.

Fraud detection

Financial institutions use knowledge graphs to detect fraud by modeling relationships between accounts, transactions, devices, and individuals. Suspicious patterns become visible when represented as a graph: circular money transfers, shared devices across seemingly unrelated accounts, or rapid changes in corporate ownership. PayPal, banks, and anti-money-laundering platforms run graph algorithms to score new transactions and flag rings of synthetic identities.

Enterprise search and analytics

LinkedIn's Economic Graph and similar enterprise graphs unify members, jobs, companies, and skills into a single queryable structure that powers search, recommendations, and analytics. Bloomberg's knowledge graph integrates financial, news, and reference data. Internally, large companies use enterprise graphs as the substrate for AI applications because the graph captures relationships that documents and tables alone do not.

Knowledge graphs and large language models

The interaction between knowledge graphs and LLMs has become the most active area of knowledge graph research. The relationship runs in both directions.

LLMs improve knowledge graph construction

LLMs have lowered the cost of every step in the construction pipeline. Named entity recognition, entity linking, relation extraction, and ontology mapping can all be done with prompting and few-shot examples, often without task-specific fine-tuning. The trade-off is that LLMs are slower and more expensive per document than dedicated extractors, and they can hallucinate entities or relations that do not exist in the source. In practice, teams use hybrid pipelines: LLMs for the hard cases and rare relations, supervised models for the bulk of routine extraction, and human review for the most consequential additions.

Knowledge graphs improve LLMs

LLMs are unreliable on factual recall, especially for long-tail entities and recent events. Grounding generation in a knowledge graph reduces hallucination by retrieving facts at query time and prompting the model to use only those facts. The retrieved subgraph also provides explicit citation paths for explainability and compliance. Studies in 2024 and 2025 show that KG-grounded LLM systems outperform pure RAG and pure LLM baselines on enterprise question answering, especially for multi-hop questions.

KG-RAG and GraphRAG

GraphRAG, introduced by Darren Edge, Ha Trinh, and colleagues at Microsoft Research in the April 2024 paper "From Local to Global: A Graph RAG Approach to Query-Focused Summarization", is the most influential KG-RAG framework ^[22]. The Microsoft pipeline runs in four stages.

Indexing: an LLM processes source documents, extracts entities and relations, and builds a knowledge graph.
Community detection: the graph is partitioned into communities of densely connected entities, typically with the Leiden algorithm.
Hierarchical summarization: the LLM generates summaries for each community at multiple levels of abstraction, producing a coarse-to-fine map of the corpus.
Query processing: at query time, the system uses both community summaries and direct graph traversal to assemble context for the LLM.

GraphRAG addresses a fundamental limitation of standard retrieval-augmented generation: the inability to answer global questions that require synthesizing information across an entire corpus. A question like "What are the main themes in this dataset?" cannot be answered by retrieving a handful of similar text chunks. The community summaries and graph structure give the system a way to reason across the full breadth of the data. Microsoft's benchmarks showed substantial improvements over baseline RAG on comprehensiveness and diversity for global sensemaking questions ^[22].

LightRAG and cost optimization

A significant limitation of GraphRAG is its computational cost: the indexing phase requires multiple LLM calls per document. LightRAG, introduced by Zirui Guo and colleagues in October 2024, addresses this with a dual-level retrieval system that achieves comparable accuracy with roughly an order of magnitude fewer tokens ^[23]. LightRAG retrieves entities and relationships rather than full text chunks, uses a low-level path for narrow entity-specific queries and a high-level path for broad thematic queries, and incorporates incremental updates so that adding new documents does not require rebuilding the entire index.

Other variants in the same lineage include FastGraphRAG, MiniRAG, and HippoRAG (which draws inspiration from the human hippocampus's role in memory consolidation). The collective effect has been to push KG-RAG from a research curiosity to a production technique with clear cost and quality trade-offs against vector-only RAG.

Hybrid graph plus vector architectures

By 2025, hybrid architectures combining vector search with knowledge graphs had become the recommended approach for enterprise generative AI. Vector search excels at semantic similarity and fuzzy matching, while graph traversal excels at multi-hop, entity-centric reasoning and structured constraints. LinkedIn reported that adding a knowledge graph layer to its customer support assistant reduced ticket resolution time from 40 hours to 15 hours, a 63% improvement ^[2]. Public case studies in finance, legal, and healthcare have reported similar gains in answer accuracy and citation quality.

KG-augmented reasoning frameworks

A newer line of work uses LLMs to plan and execute multi-step graph queries. Frameworks such as Think-on-Graph and KG-LLM use the LLM to decompose a question into subqueries, traverse the graph step by step, and assemble the final answer with explicit reasoning paths. The 2025 Knowledge Graph Language for LLMs (KGL-LLM) showed that real-time KG context reduced completion errors and improved factual accuracy across several benchmarks ^[15].

Limitations and challenges

Incompleteness

Knowledge graphs are inherently incomplete. No graph captures every fact about every entity, and missing information can lead to incorrect inferences. If a biomedical graph lacks the fact that a certain drug interacts with a particular medication, a clinical decision support system built on that graph may fail to flag a dangerous combination. The open-world assumption used in RDF means that the absence of a fact does not imply its falsehood, which is logically conservative but operationally awkward.

Scale and query performance

As knowledge graphs grow to billions of triples, maintaining query performance becomes difficult. Multi-hop traversal queries can be computationally expensive, and indexing strategies that work for millions of triples may not scale to billions. Distributed graph databases help but introduce complexity around data partitioning and consistency. The Wikidata Query Service has had documented latency issues at scale, and several large public endpoints have been split into separate shards or scopes.

Data quality and noise

Automated extraction is noisy. NELL famously had to be retrained periodically to remove drifted concepts. DBpedia and Wikidata both contain known errors, vandalism, and inconsistencies that the community works to fix. Even Google's Knowledge Graph has propagated incorrect facts to millions of users, including misattributed deaths and confused identity merges.

Entity resolution

Determining when two records refer to the same real-world entity is a persistent challenge. The entity John Smith in one data source may or may not be the same person as J. Smith in another. Entity resolution at scale requires matching algorithms that combine name similarity, contextual features, and probabilistic reasoning, and it almost always benefits from human oversight on the harder cases.

Schema evolution

As the domain evolves, the ontology underlying a knowledge graph must evolve too. Adding new entity types, relationships, or constraints without breaking existing queries and applications requires careful schema management. The challenge is acute for enterprise graphs that span dozens of teams and applications.

Knowledge staleness

Facts change over time. CEOs change, countries are renamed, prices fluctuate, and scientific understanding evolves. Keeping a knowledge graph current requires update mechanisms that are especially hard for facts extracted from static documents. Time-aware knowledge graphs and event-based update pipelines have been proposed but are not yet standard.

Bias and representation

Knowledge graphs inherit biases from their source data. If the sources overrepresent certain demographics, geographies, or perspectives, the graph will too. Research has documented systematic biases in public knowledge graphs, including gender imbalances in Wikidata's biography coverage and geographic skew toward English-speaking countries and Western Europe ^[24]. Bias propagates downstream when the graph is used to train embedding models or to ground LLM responses.

Uncertainty and provenance

The basic triple model treats every fact as binary: either the graph asserts it or not. Real facts come with uncertainty, source attribution, and temporal scope. Wikidata addresses this with statement qualifiers and references; RDF reification supports it less elegantly. Probabilistic knowledge graphs and approaches like Markov Logic Networks have been proposed but remain niche.

Current state in 2025 and 2026

Knowledge graphs are in a strong position relative to a few years ago, mostly because the rise of generative AI has made structured knowledge useful in new ways.

Hybrid retrieval, combining vector search with graph traversal, has become the default architecture for enterprise generative AI. Vendors have moved quickly: most major graph databases now expose vector indexes, and most major vector databases expose graph traversal primitives. The line between the two categories has blurred to the point that observers describe a converged AI database market.

Multimodal knowledge graphs are emerging that incorporate not only text-based facts but also images, audio, and video. Google's Knowledge Graph is used to keep multimodal AI experiences (text, image, video, voice) consistent, and other research links knowledge graph entities to embeddings derived from CLIP, audio encoders, and video models for cross-modal search and grounding.

Open standards coexist with property graph models. The W3C's RDF and SPARQL standards still underpin much of the public knowledge graph ecosystem, including Wikidata, DBpedia, and most life sciences graphs. Property graphs (Neo4j, TigerGraph, others) dominate the enterprise market. The ISO GQL standard, finalized in 2024 and based heavily on Cypher, gives the property graph world its first international standard query language.

The graph database market continues to expand, with Neo4j, Amazon Neptune, TigerGraph, ArangoDB, and Memgraph among the leading platforms and newer entrants such as PuppyGraph carving out specialty niches. Triple store vendors such as Stardog, GraphDB, and AllegroGraph have repositioned around AI grounding and enterprise data fabric use cases.

Research interest has rebounded sharply. A July 2025 survey cataloged dozens of GraphRAG publications spanning healthcare, finance, legal, and software engineering domains ^[25], and major NLP and AI conferences have reinstated dedicated knowledge graph tracks. The broad bet is that grounded AI will not be pure neural and not pure symbolic but a layered system in which knowledge graphs supply the structured truth that LLMs reason over.

References

Singhal, A. "Introducing the Knowledge Graph: things, not strings." Google Blog, May 16, 2012. https://blog.google/products/search/introducing-knowledge-graph-things-not/
"GraphRAG and Knowledge Graphs: Making Your Data AI-Ready for 2026." Fluree, 2025. https://flur.ee/fluree-blog/graphrag-knowledge-graphs-making-your-data-ai-ready-for-2026/
Berners-Lee, T., Hendler, J., and Lassila, O. "The Semantic Web." Scientific American 284, no. 5 (May 2001): 34-43. https://www.scientificamerican.com/article/the-semantic-web/
"SPARQL Query Language for RDF." W3C Recommendation, January 15, 2008. https://www.w3.org/TR/rdf-sparql-query/
Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., and Ives, Z. "DBpedia: A Nucleus for a Web of Open Data." The Semantic Web: 6th International Semantic Web Conference (ISWC 2007), Busan, Korea. https://link.springer.com/chapter/10.1007/978-3-540-76298-0_52
"Freebase (database)." Wikipedia. https://en.wikipedia.org/wiki/Freebase_(database)
Suchanek, F. M., Kasneci, G., and Weikum, G. "Yago: A Core of Semantic Knowledge." Proceedings of the 16th International Conference on World Wide Web (WWW 2007). https://suchanek.name/work/publications/www2007.pdf
"YAGO 4.5: A Large and Clean Knowledge Base with a Rich Taxonomy." arXiv:2308.11884. https://arxiv.org/abs/2308.11884
Carlson, A., Betteridge, J., Kisiel, B., Settles, B., Hruschka, E. R., and Mitchell, T. M. "Toward an Architecture for Never-Ending Language Learning." Proceedings of AAAI 2010. https://www.cs.cmu.edu/~tom/pubs/NELL_aaai15.pdf
Mintz, M., Bills, S., Snow, R., and Jurafsky, D. "Distant supervision for relation extraction without labeled data." Proceedings of ACL-IJCNLP 2009. https://nlp.stanford.edu/pubs/mintz09.pdf
"Wikidata." Wikipedia. https://en.wikipedia.org/wiki/Wikidata
"Knowledge Graph vs Ontology: Know Differences." PuppyGraph. https://www.puppygraph.com/blog/knowledge-graph-vs-ontology
Wang, Z., Wang, H., Wen, J.-R., and Xiao, Y. "Microsoft Concept Graph: Mining Semantic Concepts for Short Text Understanding." Data Intelligence 1, no. 3 (2019): 238-270. https://direct.mit.edu/dint/article/1/3/238/9983/
Banko, M., Cafarella, M. J., Soderland, S., Broadhead, M., and Etzioni, O. "Open Information Extraction from the Web." IJCAI 2007. Etzioni, O., Fader, A., Christensen, J., Soderland, S., and Mausam. "Open Information Extraction: The Second Generation." IJCAI 2011. https://homes.cs.washington.edu/~mausam/papers/ijcai11.pdf
Guo, Z., et al. "Practices, Opportunities and Challenges in the Fusion of Knowledge Graphs and Large Language Models." Frontiers in Computer Science, 2025. https://www.frontiersin.org/journals/computer-science/articles/10.3389/fcomp.2025.1590632/full
Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., and Yakhnenko, O. "Translating Embeddings for Modeling Multi-relational Data." Advances in Neural Information Processing Systems 26 (NIPS 2013). https://papers.nips.cc/paper/5071-translating-embeddings-for-modeling-multi-relational-data
Yang, B., Yih, W., He, X., Gao, J., and Deng, L. "Embedding Entities and Relations for Learning and Inference in Knowledge Bases." ICLR 2015. https://arxiv.org/abs/1412.6575
Trouillon, T., Welbl, J., Riedel, S., Gaussier, E., and Bouchard, G. "Complex Embeddings for Simple Link Prediction." ICML 2016. http://proceedings.mlr.press/v48/trouillon16.pdf
Dettmers, T., Minervini, P., Stenetorp, P., and Riedel, S. "Convolutional 2D Knowledge Graph Embeddings." AAAI 2018. https://arxiv.org/abs/1707.01476
Sun, Z., Deng, Z.-H., Nie, J.-Y., and Tang, J. "RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space." ICLR 2019. https://arxiv.org/abs/1902.10197
Schlichtkrull, M., Kipf, T. N., Bloem, P., van den Berg, R., Titov, I., and Welling, M. "Modeling Relational Data with Graph Convolutional Networks." ESWC 2018. https://arxiv.org/abs/1703.06103
Edge, D., Trinh, H., Cheng, N., Bradley, J., Chao, A., Mody, A., Truitt, S., and Larson, J. "From Local to Global: A Graph RAG Approach to Query-Focused Summarization." arXiv:2404.16130, April 2024. https://arxiv.org/abs/2404.16130
Guo, Z., Xia, L., Yu, Y., Ao, T., and Huang, C. "LightRAG: Simple and Fast Retrieval-Augmented Generation." arXiv:2410.05779, October 2024. https://arxiv.org/abs/2410.05779
Janowicz, K., Shimizu, C., Hitzler, P., et al. "Knowledge Graph Bias: A Review." Transactions on Graph Data and Knowledge 1, no. 1 (2023).
"GraphRAG Publications Overview for July 2025." DataVera. https://datavera.org/en/graphrag-july2025.html
Lehmann, J., Isele, R., Jakob, M., et al. "DBpedia: A Large-Scale, Multilingual Knowledge Base Extracted from Wikipedia." Semantic Web Journal 6, no. 2 (2015): 167-195.
Speer, R., Chin, J., and Havasi, C. "ConceptNet 5.5: An Open Multilingual Graph of General Knowledge." Proceedings of AAAI 2017.
"Unified Medical Language System (UMLS)." US National Library of Medicine. https://www.nlm.nih.gov/research/umls/index.html

History

Early foundations

The Semantic Web (1999 to 2008)

DBpedia and Freebase (2007)

YAGO (2007)

NELL and other extraction projects (2010)

Google Knowledge Graph (May 16, 2012)

Wikidata (October 29, 2012) and the Freebase migration

Resurgence with LLMs (2023 onward)

Schema and structure

Entities (nodes)

Relationships (edges)

Triples

Properties

Ontologies and schema

Types of knowledge graphs

General-purpose knowledge graphs

Domain-specific knowledge graphs

Enterprise knowledge graphs

Notable knowledge graphs

Construction methods

Manual curation

Information extraction from text

Distant supervision

Crowdsourcing

Web scraping and wrapper induction

Integration of existing databases

LLM-augmented construction

Knowledge graph completion

Knowledge graph embeddings

Major embedding models

Uses of embeddings

Query languages

Storage

Reasoning

Applications

Search and information retrieval

Question answering

Recommendation

Drug discovery

Fraud detection

Enterprise search and analytics

Knowledge graphs and large language models

LLMs improve knowledge graph construction

Knowledge graphs improve LLMs

KG-RAG and GraphRAG

LightRAG and cost optimization

Hybrid graph plus vector architectures

KG-augmented reasoning frameworks

Limitations and challenges

Incompleteness

Scale and query performance

Data quality and noise

Entity resolution

Schema evolution

Knowledge staleness

Bias and representation

Uncertainty and provenance

Current state in 2025 and 2026

See also

References

Improve this article

Related Articles

Cyc

Commonsense reasoning

History

Early foundations

The Semantic Web (1999 to 2008)

DBpedia and Freebase (2007)

YAGO (2007)

NELL and other extraction projects (2010)

Google Knowledge Graph (May 16, 2012)

Wikidata (October 29, 2012) and the Freebase migration

Resurgence with LLMs (2023 onward)

Schema and structure

Entities (nodes)

Relationships (edges)

Triples

Properties

Ontologies and schema