A knowledge graph is a structured representation of real-world entities and the relationships between them, organized as a network of interconnected nodes and edges. Each node represents an entity (such as a person, place, organization, or concept), while each edge represents a relationship between two entities (such as "born in," "founded by," or "is a type of"). Knowledge graphs store facts in a machine-readable format that enables both humans and software systems to query, reason over, and derive new insights from large bodies of information.
The term gained widespread recognition in 2012 when Google introduced its Knowledge Graph to enhance search results with structured information panels [1]. Since then, knowledge graphs have become foundational infrastructure for artificial intelligence applications ranging from search engines and recommendation systems to drug discovery and fraud detection. As of 2026, Gartner identifies knowledge graphs as a "Critical Enabler" with immediate impact on generative AI systems [2].
The ideas underlying knowledge graphs stretch back decades. In the 1960s and 1970s, researchers in artificial intelligence developed semantic networks, graph structures that represented concepts and their relationships. These early systems, including Ross Quillian's semantic memory model (1968) and Marvin Minsky's frames (1975), attempted to encode human knowledge in formats that machines could process.
The Semantic Web initiative, proposed by Tim Berners-Lee in 2001, formalized many of these ideas. The Resource Description Framework (RDF), published as a W3C recommendation in 1999 and revised in 2004, provided a standard data model for representing information as subject-predicate-object triples. The Web Ontology Language (OWL), released in 2004, added richer vocabulary for defining classes, properties, and logical constraints.
Two landmark projects launched in 2007 brought large-scale, publicly available knowledge graphs into existence.
DBpedia extracted structured information from Wikipedia infoboxes and converted it into RDF triples. By parsing the semi-structured data in Wikipedia's infoboxes, DBpedia created a queryable graph of millions of entities, making the implicit structure of Wikipedia explicit and machine-readable [3].
Freebase, created by Metaweb Technologies, took a different approach: it invited users to contribute and curate facts about real-world entities in a collaborative, open knowledge base. Google acquired Metaweb in 2010, gaining access to Freebase's rich entity data [4].
On May 16, 2012, Google announced the Google Knowledge Graph, a system that enhanced search results by displaying structured information panels (known as Knowledge Panels) alongside traditional search results [1]. The system drew on multiple data sources including Freebase, Wikipedia, and the CIA World Factbook.
The impact was immediate. Google's tagline for the launch, "things, not strings," captured the shift from keyword matching to entity understanding. Within seven months, the Knowledge Graph tripled in size, covering 570 million entities and 18 billion facts. By mid-2016, Google reported that the Knowledge Graph contained 70 billion facts and answered roughly one-third of the 100 billion monthly searches it handled [1].
Shortly after Google's announcement, the Wikimedia Foundation launched Wikidata on October 29, 2012 [5]. Wikidata serves as a free, open, multilingual knowledge base that any person or machine can edit. Unlike Freebase, which Google eventually closed in 2015 (migrating its content to Wikidata), Wikidata has continued to grow as a community-driven resource. As of 2025, Wikidata contains over 115 million data items and approximately 1.65 billion statements [5].
Other notable knowledge graphs emerged during this period. YAGO (Yet Another Great Ontology), first released in 2007 by the Max Planck Institute for Informatics, combined information from Wikipedia, WordNet, and GeoNames. YAGO 4.5, the most recent version, contains over 17 million entities and more than 150 million facts organized under a clean taxonomy based on schema.org [6].
ConceptNet, originating from the MIT Open Mind Common Sense project launched in 1999, focused on commonsense knowledge rather than encyclopedic facts. It captures everyday knowledge such as "a dog is a pet" and "rain makes the ground wet," making it valuable for natural language understanding tasks [7].
Knowledge graphs represent information using a graph data model. The core building blocks are entities, relationships, and properties.
Entities are the fundamental objects in a knowledge graph. Each entity represents a distinct real-world thing: a person, an organization, a location, a product, or an abstract concept. Every entity is assigned a unique identifier. In Wikidata, for example, Albert Einstein is identified as Q937, while the concept of "physicist" is Q169470.
Relationships connect pairs of entities and describe how they are related. A relationship is a directed, labeled edge in the graph. For example, the relationship "Albert Einstein (Q937) -- occupation --> Physicist (Q169470)" connects the entity for Einstein to the entity for physicist through the "occupation" relationship.
The fundamental unit of information in a knowledge graph is the triple, also called a statement, consisting of three parts: subject, predicate, and object. For example:
Triples follow the RDF standard (subject-predicate-object), and they can be stored in specialized databases called triple stores that support the SPARQL query language for graph queries.
Entities can also have properties: attributes that describe characteristics of the entity itself rather than its relationships to other entities. For example, Albert Einstein has properties like birthDate (March 14, 1879) and deathDate (April 18, 1955). Properties attach literal values (strings, numbers, dates) to entities.
An ontology provides the conceptual schema for a knowledge graph: it defines the types of entities, the types of relationships allowed between them, and the constraints that govern the graph's structure. Ontologies are typically written in OWL or RDFS (RDF Schema).
For example, an ontology might specify that a "Person" entity can have an "occupation" relationship pointing to a "Profession" entity, but not to a "City" entity. This schema enforcement ensures data consistency and enables automated reasoning. If the ontology states that "every University is an Organization" and "MIT is a University," a reasoning engine can infer that "MIT is an Organization" without this fact being explicitly stored.
The distinction matters: an ontology supplies the conceptual framework and semantic consistency rules, while the knowledge graph itself holds the actual instance data about real-world entities [8].
Knowledge graphs can be classified by their scope and purpose.
These aim to represent broad world knowledge across many domains. Google Knowledge Graph, Wikidata, and DBpedia are prominent examples. They cover millions of entities spanning people, places, organizations, events, scientific concepts, and more. General-purpose knowledge graphs are valuable for search engines, virtual assistants, and any application that needs broad factual grounding.
Domain-specific graphs focus on a particular field and capture specialized knowledge that general-purpose graphs typically lack. Examples include:
Organizations build internal knowledge graphs that integrate data from databases, CRM systems, product catalogs, internal documents, and other proprietary sources. These graphs provide a unified view of organizational knowledge and power internal search, analytics, and AI applications. Companies like LinkedIn, Airbnb, and Amazon have built extensive enterprise knowledge graphs to support their products [9].
The following table summarizes the most prominent knowledge graphs as of 2026.
| Knowledge Graph | Creator | Launch Year | Type | Scale | Primary Use |
|---|---|---|---|---|---|
| Google Knowledge Graph | 2012 | Proprietary, general-purpose | 70+ billion facts (as of 2016) | Search enhancement, Google Assistant, Google Maps | |
| Wikidata | Wikimedia Foundation | 2012 | Open, general-purpose | 115+ million items, 1.65 billion statements | Wikipedia support, open data, research |
| DBpedia | Leipzig University, Mannheim University | 2007 | Open, general-purpose | ~6 million entities (English) | Semantic Web, linked data research |
| YAGO | Max Planck Institute | 2007 | Open, general-purpose | 17+ million entities, 150+ million facts | Research, NLP benchmarks |
| ConceptNet | MIT Media Lab | 1999 (as OMCS) | Open, commonsense | 21 million edges, 8 million nodes | Commonsense reasoning, NLU |
| Microsoft Academic Graph | Microsoft | 2015 | Open (discontinued 2021) | 250+ million publications | Academic search and analysis |
| UniProt | UniProt Consortium | 2002 | Open, biomedical | 250+ million protein sequences | Protein research, bioinformatics |
Building a knowledge graph requires extracting structured information from diverse sources. Several methods are used, often in combination.
Human experts review sources and manually create entities, relationships, and properties. This approach produces high-quality, reliable data but is expensive and does not scale well. Wikidata relies heavily on volunteer contributors who manually add and verify statements, supported by automated bots that handle routine tasks [5].
Natural language processing techniques automatically extract structured information from unstructured text. The pipeline typically involves several steps:
Large language models have significantly improved these extraction tasks. Research from 2025 demonstrates that LLMs can perform entity and relation extraction with substantially higher accuracy than earlier supervised models, particularly when processing complex or ambiguous text [10].
Platforms like Wikidata and the former Freebase rely on large communities of contributors to build and maintain their graphs. Crowdsourcing combines the scale of automated methods with a degree of human quality control, though it introduces challenges around contributor reliability and vandalism.
Semi-structured data from websites (tables, infoboxes, product listings) can be systematically extracted and converted into triples. DBpedia's extraction of Wikipedia infoboxes is a well-known example of this approach.
Enterprise knowledge graphs are often built by integrating data from relational databases, APIs, spreadsheets, and other structured sources. Schema mapping and entity resolution (identifying when records in different databases refer to the same real-world entity) are key challenges in this process.
Knowledge graph embeddings are learned vector representations of entities and relationships in a continuous, low-dimensional space. These embeddings enable mathematical operations over graph elements, supporting tasks like link prediction (predicting missing relationships), entity classification, and knowledge graph completion.
| Model | Year | Approach | Core Idea |
|---|---|---|---|
| TransE | 2013 | Translational | Represents relationships as translations: head + relation should equal tail in vector space |
| TransR | 2015 | Translational | Extends TransE by projecting entities into relation-specific spaces |
| DistMult | 2015 | Bilinear | Uses a diagonal matrix for each relation; scores triples via bilinear product |
| ComplEx | 2016 | Complex-valued | Extends DistMult to complex-valued embeddings, capturing asymmetric relations |
| RotatE | 2019 | Rotational | Models relations as rotations in complex space |
| ConvE | 2018 | Convolutional | Applies 2D convolution over reshaped entity and relation embeddings |
TransE, introduced by Bordes et al. in 2013, is the foundational model [11]. Its simplicity and scalability have made it a popular baseline, though it struggles with one-to-many and many-to-many relationships. Later models like RotatE and ComplEx address these limitations by using more expressive mathematical frameworks.
More recent work (2023 onward) integrates graph neural networks (GNNs) and attention-based architectures to capture complex multi-hop interactions. Contrastive learning techniques have also been applied to improve embedding quality, particularly for large-scale graphs [10].
Knowledge graphs serve as a critical infrastructure layer for numerous AI applications.
Google's Knowledge Graph is the most visible example. When a user searches for "Albert Einstein," the Knowledge Panel displaying his birth date, notable works, and related people is powered by the Knowledge Graph. This structured representation enables search engines to understand queries at the entity level rather than relying solely on keyword matching.
Knowledge graph question answering (KGQA) systems translate natural language questions into structured queries (typically SPARQL) that can be executed against a knowledge graph. For example, the question "Who founded the company that made the iPhone?" requires traversing two relationships: iPhone -> madeBy -> Apple, and Apple -> foundedBy -> Steve Jobs. Knowledge graphs provide the structured data that makes this multi-hop reasoning possible.
E-commerce platforms and streaming services use knowledge graphs to model relationships between products, users, genres, and attributes. Unlike collaborative filtering alone, knowledge graph-based recommendations can explain why an item was recommended ("Because you liked sci-fi movies directed by Denis Villeneuve") and can address the cold-start problem by leveraging entity attributes even when user interaction data is sparse [9].
Pharmaceutical companies use biomedical knowledge graphs to identify potential drug targets, predict drug interactions, and repurpose existing drugs for new conditions. By modeling relationships between genes, proteins, diseases, pathways, and chemical compounds, researchers can computationally explore hypotheses that would take years to test experimentally. Companies like BenevolentAI and Insilico Medicine have built proprietary knowledge graphs for this purpose.
Financial institutions use knowledge graphs to detect fraud by modeling relationships between accounts, transactions, devices, and individuals. Suspicious patterns, such as circular money transfers, shared devices across seemingly unrelated accounts, or rapid changes in corporate ownership, become visible when represented as a graph.
GraphRAG is a technique that combines knowledge graphs with retrieval-augmented generation (RAG) to improve the accuracy and reasoning capabilities of large language models. Microsoft Research introduced the approach in an April 2024 paper titled "From Local to Global: A Graph RAG Approach to Query-Focused Summarization" [12].
Traditional RAG systems retrieve flat text chunks from a vector database based on semantic similarity to a query. GraphRAG extends this by constructing a knowledge graph from the source documents, extracting entities and their relationships, and then using graph traversal alongside vector search during retrieval.
The Microsoft implementation follows a pipeline:
GraphRAG addresses a fundamental limitation of standard RAG: the inability to answer "global" questions that require synthesizing information across an entire corpus. A question like "What are the main themes in this dataset?" cannot be answered by retrieving a handful of similar text chunks. GraphRAG's community summaries and graph structure enable the system to reason across the full breadth of the data.
Microsoft's benchmarks showed that GraphRAG consistently outperformed baseline RAG on comprehensiveness and diversity metrics for global sensemaking questions [12].
A significant limitation of GraphRAG is its computational cost, as the indexing phase requires multiple LLM calls to extract entities and relationships from every document. LightRAG, introduced in October 2024, addresses this by using a dual-level retrieval system that achieves comparable accuracy with approximately 10x token reduction [13]. Other variants, including FastGraphRAG and MiniRAG, have further optimized the approach for production deployment.
By 2025, GraphRAG had moved from research prototype to production deployment. Organizations investing in generative AI increasingly adopted hybrid architectures combining vector search with knowledge graphs. LinkedIn's implementation of knowledge graph-enhanced retrieval reduced ticket resolution time from 40 hours to 15 hours, a 63% improvement [2]. The research community has also expanded rapidly: a July 2025 survey cataloged dozens of GraphRAG publications spanning healthcare, finance, legal, and software engineering domains [14].
As knowledge graphs grow to billions of triples, maintaining query performance becomes difficult. Graph queries that require multi-hop traversal can be computationally expensive, and indexing strategies that work for millions of triples may not scale to billions. Distributed graph databases help but introduce complexity around data partitioning and consistency.
Knowledge graphs are inherently incomplete. No graph captures every fact about every entity, and missing information can lead to incorrect inferences. For example, if a knowledge graph lacks the fact that a certain drug interacts with a particular medication, a healthcare system built on that graph might fail to flag a dangerous combination. Maintaining accuracy as graphs evolve over time requires continuous curation.
Determining when two records refer to the same real-world entity is a persistent challenge. The entity "John Smith" in one data source may or may not be the same person as "J. Smith" in another. Entity resolution at scale requires sophisticated matching algorithms and often human oversight.
As the domain evolves, the ontology underlying a knowledge graph must evolve too. Adding new entity types, relationships, or constraints without breaking existing queries and applications requires careful schema management. This is particularly challenging for large, multi-team enterprise knowledge graphs.
Facts change over time. Company CEOs change, countries are renamed, and scientific understanding evolves. Keeping a knowledge graph current requires mechanisms for detecting and incorporating changes, which is especially difficult for facts extracted from static document collections.
Knowledge graphs inherit biases from their source data. If the sources overrepresent certain demographics, geographies, or perspectives, the graph will too. Research has documented systematic biases in public knowledge graphs, including gender imbalances and geographic skew toward English-speaking countries [15].
Knowledge graphs are experiencing a resurgence driven by the rise of generative AI. Several trends define the current landscape.
Integration with LLMs. The relationship between knowledge graphs and large language models has become bidirectional. LLMs help build knowledge graphs (through improved entity and relation extraction), and knowledge graphs improve LLMs (by providing structured, factual grounding that reduces hallucination). A dedicated Knowledge Graph Language (KGL-LLM), introduced by Guo et al. in 2025, enables precise integration, reducing completion errors through real-time context retrieval [10].
Enterprise adoption. Organizations are building enterprise knowledge graphs at increasing scale. Hybrid architectures that combine vector search indexes with knowledge graphs have become the recommended approach for production AI systems. This combination provides both semantic similarity search and structured relational reasoning.
Multimodal knowledge graphs. Emerging knowledge graphs incorporate not just text-based facts but also images, audio, and video. Google's use of the Knowledge Graph to ensure consistency across multimodal AI experiences (spanning text, image, video, and voice inputs) illustrates this trend [1].
Open standards and interoperability. The W3C's RDF and SPARQL standards continue to underpin much of the knowledge graph ecosystem, though property graph models (used by databases like Neo4j) have gained significant market share. Efforts to bridge these two paradigms, such as the GQL standard, are underway.
Graph databases market growth. The graph database market, closely tied to knowledge graph adoption, continues to expand. Neo4j, Amazon Neptune, TigerGraph, and ArangoDB are among the leading platforms, with newer entrants like PuppyGraph offering virtualized graph layers over existing data warehouses [9].