Graph
Last reviewed
May 11, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 2,390 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 2,390 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms
In machine learning the word graph carries two distinct meanings. The first is the computation graph: a directed acyclic record of the operations that a deep learning framework performs on tensors, which makes automatic differentiation and backpropagation practical at scale. The second is the graph as a data structure: a set of nodes connected by edges, used to represent relational data such as social networks, molecules, and citation networks, and processed by graph neural networks (GNNs). The two senses share almost nothing beyond the name. A computation graph is an internal bookkeeping object inside a framework like TensorFlow, PyTorch, or JAX; a graph in the data-structure sense is the input to a model.
A graph is formally defined as a pair G = (V, E), where V is a set of nodes (vertices) and E is a set of edges (links) connecting pairs of nodes. This definition dates to Euler's 1736 paper on the Königsberg bridges and carries over almost untouched into machine learning. What machine learning adds is attributes: each node, each edge, and sometimes the graph as a whole can carry a feature vector. A protein-protein interaction graph, for instance, might attach an amino acid sequence to every node and a binding affinity score to every edge.
In a directed graph the edges have a direction, indicating an asymmetric relationship between the connected nodes. The follower-followee relation on Twitter is the canonical example. In an undirected graph edges have no direction; a Facebook friendship is the textbook case. Many models, including the original graph convolutional network, assume the graph is undirected, though directed and heterogeneous variants exist.
In a weighted graph every edge carries a scalar weight representing the strength, distance, or significance of the connection. Road networks use weights for travel time; recommendation graphs use them for interaction counts. In an unweighted graph every edge is treated as equal. Many real datasets sit between these extremes: edges either exist or do not, but each carries a richer feature vector instead of a single scalar.
A graph is homogeneous when all nodes are the same type and all edges represent the same relation, and heterogeneous otherwise. A knowledge graph such as Wikidata is heterogeneous. A graph is static if its structure does not change over time, and dynamic if nodes or edges appear and disappear, as in trading networks or temporal interaction logs.
Graph neural networks are a family of models that consume graph-structured data directly, without flattening it into a fixed-size vector first. The unifying idea is message passing, formalized as Message Passing Neural Networks (MPNN) by Justin Gilmer and colleagues at Google Brain in their 2017 paper Neural Message Passing for Quantum Chemistry. In an MPNN each node sends a message to its neighbors, collects the messages it receives, then updates its hidden state using an aggregation function followed by a learned update function. Stacking k such layers lets information flow across paths of length k in the graph.
A core property of message passing is permutation equivariance: relabeling the nodes does not change the output, because aggregation uses a permutation-invariant operator such as sum, mean, or max. That property is what makes GNNs sensible models for graphs, which have no canonical node order.
The graph convolutional network was introduced by Thomas Kipf and Max Welling in their 2017 ICLR paper Semi-Supervised Classification with Graph Convolutional Networks. The GCN layer can be derived as a first-order approximation of a localized spectral filter on the graph Laplacian, but its real appeal is how simple the update rule is. Each node averages the features of its neighbors (plus itself), multiplies by a learned weight matrix, and applies a nonlinearity. The normalization uses the symmetrically normalized adjacency matrix, which keeps eigenvalues bounded. GCNs work well on citation networks, where two layers are often enough.
Graph attention networks, introduced by Petar Veličković and collaborators in their 2018 ICLR paper, replace the fixed normalization weights in a GCN with learned attention coefficients. For every directed edge from node u to node v the model computes a score, normalizes the scores across v's neighbors via a softmax, and uses those scores to weight the messages. GATs typically use multi-head attention, like the Transformer. A GCN can be viewed as a degenerate GAT with attention scores fixed by the normalized adjacency matrix.
GraphSAGE (Hamilton, Ying, and Leskovec, NeurIPS 2017) was the first widely adopted inductive GNN: instead of memorizing embeddings for a fixed set of nodes, it learns aggregator functions that generalize to unseen nodes. That made GNNs practical for very large graphs such as Pinterest's PinSage system. Other variants include the Gated Graph Sequence Neural Network (GGS-NN, 2015) and the Relational GCN (R-GCN) for heterogeneous knowledge graphs.
Before GNNs took over, shallow embedding methods dominated the space. DeepWalk (Perozzi et al., KDD 2014) generates random walks on the graph and feeds them into the skip-gram objective from word2vec, treating each walk as a sentence and each node as a word. Node2Vec (Grover and Leskovec, KDD 2016) generalizes DeepWalk with two parameters p and q that bias the walks toward either breadth-first or depth-first exploration. These methods are still useful when node features are unavailable, and they sometimes initialize GNNs.
A theoretical result that shaped GNN research is that standard message passing GNNs are at most as powerful as the 1-Weisfeiler-Leman graph isomorphism test, proven by Xu et al. (Graph Isomorphism Network, ICLR 2019) and Morris et al. (AAAI 2019). Pairs of non-isomorphic graphs exist that no standard GNN can distinguish. Higher-order GNNs and subgraph-based models have been proposed to push past this ceiling.
GNN-based methods have produced concrete wins across several domains. AlphaFold 2 (DeepMind, 2020) uses graph-based attention modules over residue pairs in its protein structure prediction pipeline, the system that broke open the CASP14 benchmark. In drug discovery, MPNNs predict molecular properties from atom-and-bond graphs. In recommendation, PinSage at Pinterest uses GNNs over user-item bipartite graphs. Sanchez-Gonzalez et al. (DeepMind, 2020) showed that learned GNN simulators can match classical solvers on fluid and rigid-body dynamics. Graphs also appear in fraud detection, anomaly detection on system provenance logs, and combinatorial optimization heuristics.
The main software libraries are PyTorch Geometric (PyG), built on PyTorch, and Deep Graph Library (DGL), developed by AWS Shanghai and NYU Shanghai, which supports both PyTorch and TensorFlow. Both provide message passing primitives, benchmark datasets, and reference implementations of GCN, GAT, GraphSAGE, and dozens of other architectures.
A computation graph (also called a computational graph or dataflow graph) is a directed acyclic graph whose nodes are operations and whose edges carry tensors between them. Frameworks use this representation for two reasons. First, it makes automatic differentiation mechanical: once the forward graph is known, the reverse pass that computes gradients is another walk over the same structure, applying the chain rule at each node. Second, the graph is a compilation target; an optimizing compiler can fuse kernels, fold constants, and dispatch operations to GPUs or TPUs with much less Python overhead. Reverse-mode autodiff, what most frameworks implement, was published by Seppo Linnainmaa in 1976; backpropagation is the special case where the differentiated function is a neural network loss.
The first wave of deep learning frameworks used static graphs, sometimes called define-and-run. In TensorFlow 1.x the programmer first builds the entire graph in Python using placeholders for inputs and variables for trainable parameters, then opens a tf.Session and calls sess.run to feed concrete values through the graph. The graph is frozen once constructed; changing the model means rebuilding it.
Static graphs give the runtime a complete picture of the program ahead of time, which enables global optimizations: constant folding, common subexpression elimination, operator fusion, and memory planning. They also serialize cleanly, which is why TensorFlow's SavedModel format has been a durable deployment target. The cost is that the Python code that defines the graph is not the code that runs, which makes debugging painful and data-dependent control flow awkward. Theano, which arguably started the modern era of deep learning around 2008, used a similar model.
PyTorch, released by Facebook AI Research in 2017, popularized the opposite philosophy, called define-by-run or eager execution. The graph is not declared in advance; it is built implicitly as you run Python code. Every operation on a tensor that has requires_grad=True creates a node in the graph and records the function needed to compute its gradient. When you call .backward() on a scalar loss, PyTorch walks that recorded graph in reverse, accumulates gradients into the .grad attributes of the leaves, and then discards the graph.
The leaves are the input tensors that require gradients (typically the model parameters). The roots are the tensors on which you call .backward(). Each non-leaf tensor carries a grad_fn attribute that points to the operation that produced it, and these pointers form the DAG that autograd traverses. Because the graph is rebuilt every iteration, the model can branch, loop, or use Python conditionals that depend on tensor values, and the differentiation still works. Recurrent networks with variable-length inputs and reinforcement learning policies benefit from this flexibility.
The tradeoff is that the runtime sees only one operator at a time, which limits compiler optimizations. Chainer, released by Preferred Networks in 2015, was the first widely used define-by-run framework and directly inspired PyTorch.
TensorFlow 2.0 (2019) made eager execution the default and exposed the graph machinery through the tf.function decorator. Wrapping a Python function in @tf.function triggers tracing: TensorFlow runs the function once with symbolic placeholders, records the operations, and produces a ConcreteFunction backed by a tf.Graph. Subsequent calls with the same input signature reuse the cached graph; calls with new shapes or dtypes trigger a fresh trace.
Tracing uses AutoGraph, a source-level transformer that rewrites Python control flow into graph operations. A Python if on a tensor becomes a tf.cond; a tensor-conditioned while becomes a tf.while_loop. This lets users write ordinary-looking Python and still get a static graph. Setting jit_compile=True hands the result to XLA (Accelerated Linear Algebra), which fuses operations and often produces significant speedups on TPUs and GPUs.
JAX, released by Google in 2018, exposes graph construction and compilation as user-facing function transformations. The core transformations are:
| Transformation | Purpose |
|---|---|
jax.grad | Returns a function that computes the gradient via reverse-mode autodiff. |
jax.jit | Traces the function into JAX's IR (jaxpr) and compiles it with XLA. |
jax.vmap | Vectorizes the function over a batch axis without writing an explicit loop. |
jax.pmap | Parallelizes the function across multiple devices in a single-program-multiple-data style. |
These transformations compose. A common pattern is jax.pmap(jax.jit(jax.grad(loss_fn))), a per-device JIT-compiled gradient function parallelized across a TPU pod. JAX's tracing model requires the traced function to be a pure function of its inputs, a real constraint, but in exchange the user gets predictable graph capture and strong compiler optimizations.
PyTorch 2.0, released in March 2023, closed much of the gap between eager and graph execution with torch.compile. The implementation uses TorchDynamo, a JIT compiler that hooks into the CPython frame evaluation API (PEP 523) and rewrites Python bytecode at runtime to extract sequences of PyTorch operations into an FX graph. AOTAutograd captures the backward pass ahead of time, PrimTorch decomposes operations into a smaller primitive set, and TorchInductor lowers the result to fast device kernels, often via Triton on GPUs. The PyTorch team reported that TorchDynamo captures graphs successfully on about 99% of a large benchmark suite, falling back to eager mode on code it cannot translate. The engineering direction is clear: dynamic frontends with graph-mode backends.
There are two ways to apply the chain rule over a computation graph. Forward mode propagates derivatives alongside values, from inputs to outputs, and costs n passes for n inputs. Reverse mode does one forward pass to record values, then walks backward through the graph applying the chain rule to one output at a time, and costs m passes for m outputs. Deep learning losses are scalar (m = 1) with millions of parameters (n very large), so reverse mode is dramatically cheaper. That asymmetry is the entire reason every major deep learning framework defaults to reverse-mode autodiff, which is the formal name for backpropagation.
The computation graph inside PyTorch and the graph that you feed to a GNN are unrelated objects despite the shared vocabulary of nodes and edges. The computation graph is always a DAG because it records a deterministic forward pass; an input graph in the data-structure sense may be cyclic, undirected, or both. A graph neural network has a computation graph too since it runs inside a framework, but that internal graph is distinct from the data-structure graph it consumes.