Graph
Last reviewed
Sources
20 citations
Review status
Source-backed
Revision
v3 · 3,219 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
20 citations
Review status
Source-backed
Revision
v3 · 3,219 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms
In machine learning the word graph has two unrelated meanings. The first is a graph as a data structure: a set of nodes (vertices) connected by edges, written formally as G = (V, E), used to represent relational data such as social networks, molecules, knowledge bases, and citation networks, and processed by graph neural networks (GNNs) [1][7]. The second is a computation graph: a directed acyclic record of the operations a deep learning framework performs on tensors, which makes automatic differentiation and backpropagation practical at scale [2][3]. The two senses share almost nothing beyond the vocabulary of "nodes" and "edges." A computation graph is an internal bookkeeping object inside a framework like TensorFlow, PyTorch, or JAX; a graph in the data-structure sense is the input to a model [5].
In machine learning the word graph carries two distinct meanings. The first is the graph as a data structure: a set of nodes connected by edges, used to represent relational data such as social networks, molecules, and citation networks, and processed by graph neural networks (GNNs). The second is the computation graph: a directed acyclic record of the operations that a deep learning framework performs on tensors, which makes automatic differentiation and backpropagation practical at scale. The two senses share almost nothing beyond the name. A computation graph is an internal bookkeeping object inside a framework like TensorFlow, PyTorch, or JAX; a graph in the data-structure sense is the input to a model.
A graph is formally defined as a pair G = (V, E), where V is a set of nodes (vertices) and E is a set of edges (links) connecting pairs of nodes. This definition dates to Leonhard Euler's 1736 paper on the Seven Bridges of Königsberg, the result widely regarded as the first theorem of graph theory: Euler reduced the city to a graph of 4 vertices (the two riverbanks and two islands) and 7 edges (the bridges) and proved that no walk could cross every bridge exactly once [8]. The definition carries over almost untouched into machine learning. What machine learning adds is attributes: each node, each edge, and sometimes the graph as a whole can carry a feature vector. A protein-protein interaction graph, for instance, might attach an amino acid sequence to every node and a binding affinity score to every edge.
In a directed graph the edges have a direction, indicating an asymmetric relationship between the connected nodes. The follower-followee relation on a platform such as X (formerly Twitter) is the canonical example. In an undirected graph edges have no direction; a Facebook friendship is the textbook case. Many models, including the original graph convolutional network, assume the graph is undirected, though directed and heterogeneous variants exist [9].
In a weighted graph every edge carries a scalar weight representing the strength, distance, or significance of the connection. Road networks use weights for travel time; recommendation graphs use them for interaction counts. In an unweighted graph every edge is treated as equal. Many real datasets sit between these extremes: edges either exist or do not, but each carries a richer feature vector instead of a single scalar.
A graph is homogeneous when all nodes are the same type and all edges represent the same relation, and heterogeneous otherwise. A knowledge graph such as Wikidata is heterogeneous. A graph is static if its structure does not change over time, and dynamic if nodes or edges appear and disappear, as in trading networks or temporal interaction logs. The table below summarizes the main axes.
| Axis | Type A | Type B | Example |
|---|---|---|---|
| Direction | Directed | Undirected | Follower graph vs. friendship graph |
| Edge weight | Weighted | Unweighted | Road network vs. plain link graph |
| Node/edge types | Heterogeneous | Homogeneous | Knowledge graph vs. citation network |
| Time | Dynamic | Static | Trading network vs. fixed molecule |
Graph neural networks are a family of models that consume graph-structured data directly, without flattening it into a fixed-size vector first. The Distill article A Gentle Introduction to Graph Neural Networks defines one concisely: "A GNN is an optimizable transformation on all attributes of the graph (nodes, edges, global-context) that preserves graph symmetries (permutation invariances)." [15] The unifying idea is message passing, formalized as Message Passing Neural Networks (MPNN) by Justin Gilmer and colleagues at Google Brain in their 2017 paper Neural Message Passing for Quantum Chemistry [10]. In an MPNN each node sends a message to its neighbors, collects the messages it receives, then updates its hidden state using an aggregation function followed by a learned update function. Stacking k such layers lets information flow across paths of length k in the graph.
A core property of message passing is permutation equivariance: relabeling the nodes does not change the output, because aggregation uses a permutation-invariant operator such as sum, mean, or max [15]. That property is what makes GNNs sensible models for graphs, which have no canonical node order.
The graph convolutional network was introduced by Thomas Kipf and Max Welling in their 2017 ICLR paper Semi-Supervised Classification with Graph Convolutional Networks [9]. The authors motivate the architecture "via a localized first-order approximation of spectral graph convolutions," and report a model that "scales linearly in the number of graph edges" [9]. Its real appeal is how simple the update rule is. Each node averages the features of its neighbors (plus itself), multiplies by a learned weight matrix, and applies a nonlinearity. The normalization uses the symmetrically normalized adjacency matrix, which keeps eigenvalues bounded. GCNs work well on citation networks, where two layers are often enough.
Graph attention networks, introduced by Petar Veličković and collaborators in their 2018 ICLR paper, replace the fixed normalization weights in a GCN with learned attention coefficients [11]. The paper presents "novel neural network architectures that operate on graph-structured data, leveraging masked self-attentional layers to address the shortcomings of prior methods based on graph convolutions or their approximations." [11] For every directed edge from node u to node v the model computes a score, normalizes the scores across v's neighbors via a softmax, and uses those scores to weight the messages. GATs typically use multi-head attention, like the Transformer and its underlying attention mechanism. A GCN can be viewed as a degenerate GAT with attention scores fixed by the normalized adjacency matrix.
GraphSAGE (Hamilton, Ying, and Leskovec, NeurIPS 2017) was the first widely adopted inductive GNN: instead of memorizing embeddings for a fixed set of nodes, it learns aggregator functions that generalize to unseen nodes [12]. The name abbreviates "SAmple and aggreGatE." That made GNNs practical for very large graphs such as Pinterest's PinSage system, which learns embeddings for nodes numbering in the billions [12][17]. Other variants include the Gated Graph Sequence Neural Network (GGS-NN, 2015) and the Relational GCN (R-GCN) for heterogeneous knowledge graphs.
Before GNNs took over, shallow embedding methods dominated the space. DeepWalk (Perozzi et al., KDD 2014) generates random walks on the graph and feeds them into the skip-gram objective from word2vec, treating each walk as a sentence and each node as a word [13]. Node2Vec (Grover and Leskovec, KDD 2016) generalizes DeepWalk with two parameters p and q that bias the walks toward either breadth-first or depth-first exploration [14]. These methods are still useful when node features are unavailable, and they sometimes initialize GNNs.
A theoretical result that shaped GNN research is that standard message passing GNNs are at most as powerful as the 1-Weisfeiler-Leman (1-WL) graph isomorphism test, proven by Xu et al. (Graph Isomorphism Network, ICLR 2019) and Morris et al. (AAAI 2019) [7]. The Xu et al. paper develops "a simple architecture that is provably the most expressive among the class of GNNs and is as powerful as the Weisfeiler-Lehman graph isomorphism test" [7]. Pairs of non-isomorphic graphs exist that no standard GNN can distinguish. Higher-order GNNs and subgraph-based models have been proposed to push past this ceiling.
GNN-based methods have produced concrete wins across several domains. AlphaFold 2 (DeepMind, 2020) uses graph-based attention modules (its Evoformer stack) over residue pairs in its protein structure prediction pipeline; at CASP14 it reached a median GDT score of 92.4, a level of accuracy comparable to experimental methods and the system that broke open the benchmark [16]. In drug discovery, MPNNs predict molecular properties from atom-and-bond graphs [10]. In recommendation, PinSage at Pinterest uses GNNs over user-item bipartite graphs to build a web-scale recommendation system [17]. Sanchez-Gonzalez et al. (DeepMind, 2020) showed that learned GNN simulators can match classical solvers on fluid and rigid-body dynamics. Graphs also appear in fraud detection, anomaly detection on system provenance logs, and combinatorial optimization heuristics. The same node-and-edge structure underlies the knowledge graph, where heterogeneous nodes and typed edges encode facts for search and question answering.
The main software libraries are PyTorch Geometric (PyG), built on PyTorch, and Deep Graph Library (DGL), developed by AWS Shanghai and NYU Shanghai, which supports both PyTorch and TensorFlow [18][19]. Both provide message passing primitives, benchmark datasets, and reference implementations of GCN, GAT, GraphSAGE, and dozens of other architectures.
A computation graph (also called a computational graph or dataflow graph) is a directed acyclic graph whose nodes are operations and whose edges carry tensors between them [1][2]. Frameworks use this representation for two reasons. First, it makes automatic differentiation mechanical: once the forward graph is known, the reverse pass that computes gradients is another walk over the same structure, applying the chain rule at each node [6]. Second, the graph is a compilation target; an optimizing compiler can fuse kernels, fold constants, and dispatch operations to GPUs or TPUs with much less Python overhead [1]. Reverse-mode autodiff, what most frameworks implement, was published by Seppo Linnainmaa in 1976; backpropagation is the special case where the differentiated function is a neural network loss [6].
The first wave of deep learning frameworks used static graphs, sometimes called define-and-run. In TensorFlow 1.x the programmer first builds the entire graph in Python using placeholders for inputs and variables for trainable parameters, then opens a tf.Session and calls sess.run to feed concrete values through the graph. The graph is frozen once constructed; changing the model means rebuilding it.
Static graphs give the runtime a complete picture of the program ahead of time, which enables global optimizations: constant folding, common subexpression elimination, operator fusion, and memory planning [1]. They also serialize cleanly, which is why TensorFlow's SavedModel format has been a durable deployment target. The cost is that the Python code that defines the graph is not the code that runs, which makes debugging painful and data-dependent control flow awkward. Theano, which arguably started the modern era of deep learning around 2008, used a similar model.
PyTorch, released by Facebook AI Research in 2017, popularized the opposite philosophy, called define-by-run or eager execution. The graph is not declared in advance; it is built implicitly as you run Python code [2]. Every operation on a tensor that has requires_grad=True creates a node in the graph and records the function needed to compute its gradient. When you call .backward() on a scalar loss, PyTorch walks that recorded graph in reverse, accumulates gradients into the .grad attributes of the leaves, and then discards the graph [2].
The leaves are the input tensors that require gradients (typically the model parameters). The roots are the tensors on which you call .backward(). Each non-leaf tensor carries a grad_fn attribute that points to the operation that produced it, and these pointers form the DAG that autograd traverses [2]. Because the graph is rebuilt every iteration, the model can branch, loop, or use Python conditionals that depend on tensor values, and the differentiation still works. Recurrent networks with variable-length inputs and reinforcement learning policies benefit from this flexibility.
The tradeoff is that the runtime sees only one operator at a time, which limits compiler optimizations. Chainer, released by Preferred Networks in 2015, was the first widely used define-by-run framework and directly inspired PyTorch.
TensorFlow 2.0 (2019) made eager execution the default and exposed the graph machinery through the tf.function decorator. Wrapping a Python function in @tf.function triggers tracing: TensorFlow runs the function once with symbolic placeholders, records the operations, and produces a ConcreteFunction backed by a tf.Graph. Subsequent calls with the same input signature reuse the cached graph; calls with new shapes or dtypes trigger a fresh trace [1].
Tracing uses AutoGraph, a source-level transformer that rewrites Python control flow into graph operations. A Python if on a tensor becomes a tf.cond; a tensor-conditioned while becomes a tf.while_loop. This lets users write ordinary-looking Python and still get a static graph [1]. Setting jit_compile=True hands the result to XLA (Accelerated Linear Algebra), the XLA compiler that fuses operations and often produces significant speedups on TPUs and GPUs.
JAX, released by Google in 2018, exposes graph construction and compilation as user-facing function transformations [5]. The core transformations are:
| Transformation | Purpose |
|---|---|
jax.grad | Returns a function that computes the gradient via reverse-mode autodiff. |
jax.jit | Traces the function into JAX's IR (jaxpr) and compiles it with XLA. |
jax.vmap | Vectorizes the function over a batch axis without writing an explicit loop. |
jax.pmap | Parallelizes the function across multiple devices in a single-program-multiple-data style. |
These transformations compose. A common pattern is jax.pmap(jax.jit(jax.grad(loss_fn))), a per-device JIT-compiled gradient function parallelized across a TPU pod [5]. JAX's tracing model requires the traced function to be a pure function of its inputs: as the documentation states, "all the input data is passed through the function parameters, all the results are output through the function results" [20]. That is a real constraint, but in exchange the user gets predictable graph capture and strong compiler optimizations.
PyTorch 2.0, released in early March 2023, closed much of the gap between eager and graph execution with torch.compile [3]. The implementation uses TorchDynamo, a JIT compiler that hooks into the CPython frame evaluation API (PEP 523) and rewrites Python bytecode at runtime to extract sequences of PyTorch operations into an FX graph [3][4]. AOTAutograd captures the backward pass ahead of time, PrimTorch decomposes operations into a smaller primitive set, and TorchInductor lowers the result to fast device kernels, often via Triton on GPUs [3]. The PyTorch team validated TorchDynamo on a set of 7,000+ GitHub projects written in PyTorch and reported: "While TorchScript and others struggled to even acquire the graph 50% of the time, often with a big overhead, TorchDynamo acquired the graph 99% of the time, correctly, safely and with negligible overhead." [3] It falls back to eager mode on code it cannot translate. The engineering direction is clear: dynamic frontends with graph-mode backends.
There are two ways to apply the chain rule over a computation graph. Forward mode propagates derivatives alongside values, from inputs to outputs, and costs n passes for n inputs. Reverse mode does one forward pass to record values, then walks backward through the graph applying the chain rule to one output at a time, and costs m passes for m outputs [6]. Deep learning losses are scalar (m = 1) with millions of parameters (n very large), so reverse mode is dramatically cheaper. That asymmetry is the entire reason every major deep learning framework defaults to reverse-mode autodiff, which is the formal name for backpropagation [6].
The computation graph inside PyTorch and the graph that you feed to a GNN are unrelated objects despite the shared vocabulary of nodes and edges. The computation graph is always a DAG because it records a deterministic forward pass; an input graph in the data-structure sense may be cyclic, undirected, or both [1][7]. A graph neural network has a computation graph too since it runs inside a framework, but that internal graph is distinct from the data-structure graph it consumes.
Think of a graph as dots joined by lines. The dots are "nodes" and the lines are "edges." People in a friend network are dots, friendships are lines; atoms in a molecule are dots, chemical bonds are lines. A graph neural network is a program that learns by letting each dot "whisper" what it knows to the dots it is connected to, over and over, until every dot understands its neighborhood. The other kind of graph, a computation graph, is just a recipe a computer keeps of every math step it took, so it can rewind those steps to figure out how to improve. Same word, two very different ideas.