Graph Machine Learning Models
Last reviewed
May 13, 2026
Sources
56 citations
Review status
Source-backed
Revision
v2 · 5,991 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 13, 2026
Sources
56 citations
Review status
Source-backed
Revision
v2 · 5,991 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Multimodal Models and Tasks
Graph machine learning models are neural networks designed to operate on data structured as graphs, where the input is a set of nodes connected by edges rather than a grid like an image or a sequence like text. The defining property of these models is permutation equivariance: the output for a node should not change if the graph is relabeled, which rules out treating a graph as a flat vector and motivates the message passing computation used by most modern graph neural networks (GNNs). Since the introduction of the Graph Convolutional Network (GCN) by Thomas Kipf and Max Welling in 2017, the field has expanded into hundreds of architectures, supports several large open-source libraries, and powers production systems at Google, Pinterest, Amazon, DeepMind, and Microsoft.
Unlike convolutional or recurrent networks, GNNs do not assume a fixed neighborhood or a fixed input length. Each node updates its representation by aggregating signals from its neighbors and then applying a learnable transformation. The same parameters are shared across every node and every edge, which makes the model size independent of graph size and lets a model trained on small graphs generalize to larger ones. The graphs themselves can be homogeneous (one node type, one edge type), heterogeneous (multiple types, as in knowledge graphs), directed or undirected, weighted or unweighted, and static or evolving in time.
Graph machine learning addresses five canonical tasks. The choice of task determines the loss function, the readout, and the evaluation metric.
| Task | Goal | Example | Typical readout |
|---|---|---|---|
| Node classification | Predict a label for each node | Citation network paper category | Per-node softmax |
| Link prediction | Predict whether two nodes are connected | Recommending a friend on a social network | Dot product or bilinear score |
| Graph classification | Predict a label for the whole graph | Molecule toxicity | Sum or mean pooling of node embeddings |
| Graph regression | Predict a continuous value for a graph | Molecular property like HOMO LUMO gap | Pooled embedding plus MLP |
| Graph generation | Sample new graphs from a distribution | De novo drug design | Autoregressive or diffusion decoder |
Several other settings build on these primitives. Community detection partitions nodes into clusters using embeddings from an unsupervised GNN. Subgraph matching identifies whether a small motif occurs inside a larger graph. Combinatorial optimization (traveling salesman, maximum independent set) has been attacked with GNN policies trained via reinforcement learning.
Inputs can carry features on nodes, edges, or both. A molecule has atom features (element, charge) on nodes and bond features (single, double, aromatic) on edges. A road network has road type and length on edges and intersection coordinates on nodes. Modern GNN libraries treat these features uniformly through a small set of message and update functions.
The idea of computing on graphs with neural networks predates the deep learning era. Marco Gori and Franco Scarselli proposed the original "graph neural network" in 2005 and 2009 as a recurrent fixed point system. It was hard to train and never became widely used. The current wave of GNNs grew from two lines of work that converged around 2016 and 2017.
The first line was shallow node embedding, which treats each node as a token and learns a low-dimensional vector for it without parametric message passing. DeepWalk by Bryan Perozzi, Rami Al-Rfou, and Steven Skiena, presented at KDD 2014, generated random walks on a graph and ran the word2vec skip-gram model on the resulting sequences. LINE by Jian Tang and collaborators at WWW 2015 preserved first-order and second-order proximity through an explicit objective. node2vec, by Aditya Grover and Jure Leskovec at KDD 2016, extended DeepWalk with biased random walks controlled by two parameters that interpolate between breadth-first and depth-first exploration. These models scaled to millions of nodes but produced a fixed embedding table that could not handle new nodes or edge features.
The second line was spectral graph convolution. Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun proposed in 2014 to define convolution on a graph through the eigendecomposition of the graph Laplacian. The cost was cubic in the number of nodes, and filters were not localized. Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst introduced ChebNet at NeurIPS 2016, which approximated spectral filters with Chebyshev polynomials of the Laplacian and reduced cost to linear in the number of edges. ChebNet was the immediate precursor to GCN.
Graph Convolutional Network (GCN), introduced by Kipf and Welling at ICLR 2017, simplified ChebNet to a single-hop filter with a renormalization trick and showed strong results on the Cora, CiteSeer, and PubMed citation benchmarks. Within months the field exploded. GraphSAGE by Will Hamilton, Rex Ying, and Jure Leskovec at NeurIPS 2017 introduced inductive learning through neighbor sampling. Graph Attention Network (GAT) by Petar Veličković and collaborators at ICLR 2018 added learnable attention weights. Justin Gilmer and colleagues at ICML 2017 unified existing models under the Message Passing Neural Network (MPNN) framework.
Most convolutional GNNs follow the same two step recipe. For each node, aggregate a function of the neighbor features and the connecting edge features, then update the node feature using the aggregated message and the previous state. Models differ in the aggregator (sum, mean, max, attention) and the update (linear, MLP, gated recurrent unit). The table below lists the most cited architectures.
| Architecture | Year | Authors | Aggregator | Notable property |
|---|---|---|---|---|
| GCN | 2017 | Kipf, Welling | Normalized sum with degree | Simplest spectral approximation, transductive |
| GraphSAGE | 2017 | Hamilton, Ying, Leskovec | Mean, max, or LSTM over sampled neighbors | First inductive GNN at scale |
| GAT | 2018 | Veličković et al. | Multi-head attention | Learnable neighbor weighting |
| MPNN | 2017 | Gilmer et al. | General message function plus update | Unifying framework for chemistry GNNs |
| GIN | 2019 | Xu, Hu, Leskovec, Jegelka | Sum after MLP | Provably as expressive as the 1-Weisfeiler-Lehman test |
| R-GCN | 2017 | Schlichtkrull et al. | Per-relation linear sum | First strong GNN for knowledge graphs |
| HAN | 2019 | Wang et al. | Meta-path attention | Heterogeneous node and edge types |
| Cluster-GCN | 2019 | Chiang et al. | Mini-batch within graph partitions | Scaling GCN to 100M edges |
| GraphSAINT | 2020 | Zeng et al. | Subgraph sampling with normalization | Unbiased mini-batches for large graphs |
| SIGN | 2020 | Frasca et al. | Precomputed multi-hop diffusion | Single SGD step over a feature MLP, very fast |
| PNA | 2020 | Corso et al. | Multiple aggregators with degree scalers | Strong on regular graphs |
| DiffPool | 2018 | Ying et al. | Learnable hierarchical pooling | First differentiable graph pooling |
GCN. The forward pass for a single layer computes a normalized adjacency multiplication: each node averages the features of its neighbors and itself, weighted by the inverse square root of the product of their degrees. A learnable linear map and a nonlinearity follow. Two layers of GCN cover the two-hop neighborhood of each node and reach state of the art on small citation graphs. GCN is transductive in its original formulation: the renormalized adjacency matrix is computed on the whole graph at training time, so adding new nodes later requires recomputing the matrix.
GraphSAGE. GraphSAGE solves the inductive problem by sampling a fixed number of neighbors for each node and applying a permutation invariant aggregator (mean, pool, or LSTM) over the sample. Because no full adjacency multiplication is required, GraphSAGE can produce embeddings for nodes that did not exist at training time, which is what made it production-ready for recommender systems.
GAT. GAT replaces the fixed adjacency weights with learned attention coefficients. For each edge, a shared linear map produces scores that are normalized over the neighborhood by a softmax. The model uses multiple attention heads in parallel, similar to the Transformer block, and concatenates or averages their outputs. The attention weights are interpretable in some cases and make the model robust to noisy edges.
MPNN. Gilmer and colleagues proposed MPNN as a unifying notation: each layer computes a message for each edge using the source feature, target feature, and edge feature, sums the messages into each target node, and then updates the target node with a recurrent or feed-forward function. Almost every later architecture can be written in MPNN form, and most graph libraries expose an MPNN class as the base abstraction.
GIN. Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka analyzed GNN expressivity at ICLR 2019. They proved that any sum-based aggregator followed by an MLP is at most as powerful as the 1-Weisfeiler-Lehman (1-WL) graph isomorphism test, that mean and max aggregators are strictly weaker, and that an injective sum aggregator (Graph Isomorphism Network or GIN) achieves the 1-WL bound. GIN became the standard benchmark architecture for graph classification because its theoretical properties match its empirical performance on TU datasets.
R-GCN and HAN. Real graphs often have typed nodes and edges. R-GCN, by Michael Schlichtkrull and collaborators in 2017, assigns a separate learnable weight matrix to each relation type and sums the per-relation messages. Basis decomposition or block diagonal decomposition controls the parameter count when the number of relations is large. HAN (Heterogeneous Attention Network) by Xiao Wang and colleagues at WWW 2019 generalizes attention along meta-paths, fixed sequences of relations that capture semantic patterns in a heterogeneous graph.
Scalability variants. Standard message passing requires the full neighbor set of each node at every layer, which blows up memory when the graph has hundreds of millions of edges. GraphSAINT samples a connected subgraph at each iteration and applies normalization to correct sampling bias. Cluster-GCN partitions the graph with a graph clustering algorithm (METIS) and performs mini-batch training within each partition. SIGN precomputes multi-hop diffused features once and trains a feed-forward MLP on the result, trading expressivity for raw throughput.
Graph transformers apply the self-attention mechanism to all node pairs, not only to neighbors. They lose the inductive bias of locality but gain global receptive field, which helps on tasks where long-range information matters (predicting properties of polymers, reading long chains in a parse tree). The challenge is how to inject the graph structure since vanilla self-attention is permutation invariant and would treat the graph as a bag of nodes.
Vijay Prakash Dwivedi and Xavier Bresson proposed the Graph Transformer in 2020, which adds Laplacian eigenvectors as positional encodings so the attention layer can distinguish nodes by their structural role. The Spectral Attention Network (SAN) by Devin Kreuzer and colleagues at NeurIPS 2021 extends this with learned positional encodings derived from the full eigendecomposition.
Graphormer, introduced by Chengxuan Ying, Tianle Cai, and collaborators at NeurIPS 2021, encodes graph structure through three biases added to the attention logits: a centrality encoding for each node based on its degree, a spatial encoding based on the shortest path distance between nodes, and an edge encoding aggregated along the shortest path. Graphormer won the OGB Large Scale Challenge (OGB-LSC) on the PCQM4M quantum chemistry dataset in 2021, beating every message passing baseline by a clear margin.
GraphGPS by Ladislav Rampášek, Mikhail Galkin, and Dominique Beaini at NeurIPS 2022 proposed a recipe for hybrid models. Each block contains a local message passing layer in parallel with a global attention layer, with both fed by learned positional and structural encodings. GraphGPS gave consistent gains across the Long Range Graph Benchmark and inspired a wave of hybrid architectures.
Exphormer by Hamed Shirzad and collaborators at ICML 2023 reduced the quadratic attention cost using expander graphs as a sparse global connectivity pattern, making global attention tractable on graphs with tens of thousands of nodes. NAGphormer by Jinwoo Kim and colleagues reformulated graph attention as a sequence problem over hop counts, allowing the use of standard transformer libraries.
The most recent line of work applies state space models to graphs. GraphMamba by Chloe Wang and colleagues in 2024 adapts the Mamba selective state space layer to graph data by selecting node sequences with structural relevance to the target node. Initial results suggest sub-quadratic global mixing with performance competitive with GraphGPS on long range tasks. Several follow-ups including Graph-Mamba and GMN have explored alternative scan orders such as breadth-first traversal and random walks.
Molecules and crystals carry geometry: each atom has a 3D position, and the physical properties of the system are invariant under translation, rotation, and reflection of the whole structure. Plain GNNs that use only the graph topology lose this information. Equivariant graph neural networks preserve it by ensuring that if the input coordinates rotate, the output rotates in the same way.
The pioneer was SchNet by Kristof Schütt, Pieter-Jan Kindermans, and collaborators at NeurIPS 2017. SchNet uses continuous-filter convolutions parameterized by the interatomic distance, which makes the energy prediction translation, rotation, and permutation invariant. SchNet trained on the QM9 dataset reached chemical accuracy on several molecular properties for the first time with a neural network.
DimeNet by Johannes Gasteiger, Janek Groß, and Stephan Günnemann at ICLR 2020 added directional information through messages that depend on bond angles, not only on distances. DimeNet++ improved the speed by replacing the original spherical Bessel basis with a more efficient implementation. GemNet by Gasteiger and colleagues at NeurIPS 2021 incorporated dihedral angles, capturing four-body interactions.
EGNN by Victor Garcia Satorras, Emiel Hoogeboom, and Max Welling at ICML 2021 introduced a simpler equivariant scheme that operates directly on coordinates without spherical harmonics, treating positions as features updated jointly with scalar features at each layer. NequIP by Simon Batzner and collaborators at Nature Communications 2022 used full SO(3) equivariant tensor products on top of e3nn and matched force-field accuracy with one or two orders of magnitude less training data. PaiNN by Schütt, Oliver Unke, and Michael Gastegger at ICML 2021 used a polarizable atom interaction scheme with vector-valued node features. MACE by Ilyes Batatia and collaborators at NeurIPS 2022 generalized NequIP using high-body-order equivariant features in a single layer. Allegro by Albert Musaelian and collaborators (Nature Communications 2023) introduced a strictly local equivariant model that scales to millions of atoms.
In November 2023, Google DeepMind released GNoME (Graph Networks for Materials Exploration), a GNN-based pipeline that predicted 2.2 million new crystal structures, 380 thousand of which were judged stable. The work, published in Nature, used a GNN ensemble for energy prediction combined with density functional theory verification. In January 2025, Microsoft Research released MatterGen, a diffusion model over crystal structures conditioned on target properties. The two systems are often used together: GNoME for property prediction, MatterGen for sampling new candidates.
In drug discovery the equivariant GNN community has also explored conformer prediction (GeoMol, GeoDiff), docking (DiffDock by Gabriele Corso and collaborators at ICLR 2023), and protein folding, where the Evoformer block at the heart of AlphaFold 2 uses triangle attention over a pair representation that is essentially a complete graph over residues. AlphaFold 3, announced in May 2024, extends the same architecture to nucleic acids and small molecule ligands.
A knowledge graph stores facts as triples of the form (head entity, relation, tail entity). Examples include Freebase, Wikidata, and large industrial knowledge graphs at Google and Amazon. The two main tasks are link prediction, also called knowledge graph completion, and entity classification. Knowledge graph embedding models map each entity and relation to a low-dimensional vector and score plausibility of a triple using a per-model formula.
| Model | Year | Scoring | Notable property |
|---|---|---|---|
| TransE | 2013 | Negative L1 or L2 norm of h plus r minus t | Translation in embedding space, can not model symmetric relations |
| TransH | 2014 | Translation on relation hyperplane | Handles 1-to-N and N-to-1 |
| DistMult | 2015 | Bilinear with diagonal relation matrix | Symmetric relations only |
| ComplEx | 2016 | Bilinear in complex space | Models antisymmetric relations |
| RotatE | 2019 | Rotation in complex plane | Captures symmetry, antisymmetry, and inversion |
| ConvE | 2018 | 2D convolution over reshaped embedding | Strong with limited parameters |
| R-GCN | 2017 | GNN per-relation message passing | Combines structure and embeddings |
| CompGCN | 2020 | Joint entity and relation embedding in GNN | Generalizes earlier knowledge graph methods |
| Query2Box | 2020 | Box embedding for complex queries | Supports first-order logic queries |
TransE, introduced by Antoine Bordes and colleagues at NeurIPS 2013, treats each relation as a translation vector and scores a triple by the distance between the translated head and the tail. DistMult by Bishan Yang and collaborators at ICLR 2015 replaced translation with a diagonal bilinear form. ComplEx by Théo Trouillon and collaborators at ICML 2016 lifted DistMult to complex numbers so antisymmetric relations (parent of, supervises) could be represented. RotatE by Zhiqing Sun and collaborators at ICLR 2019 modeled each relation as a rotation in complex space, which jointly captures symmetry, antisymmetry, inversion, and composition. ConvE applied a small 2D CNN to reshaped entity and relation embeddings and remains a popular baseline due to its parameter efficiency.
R-GCN combines a knowledge graph embedding objective with GNN message passing, treating each relation type as a separate channel. CompGCN by Shikhar Vashishth and collaborators at ICLR 2020 generalized this further by jointly updating entity and relation embeddings through a single GNN. Query embedding methods extend the framework to multi-hop queries: Query2Box by Hongyu Ren and collaborators at ICLR 2020 embeds each query as a box in vector space, with logical operators like intersection and union implemented as box intersection. QueryR2N and other follow-ups expand the operator set to include negation.
Graph machine learning is in production at several large companies. The applications listed below have public technical write-ups or peer-reviewed papers.
The most visible application is protein structure prediction. AlphaFold 2 by DeepMind, published in Nature in July 2021, predicts the 3D structure of a protein from its amino acid sequence by treating the residues and their pairwise relations as a graph and using a custom attention mechanism called the Evoformer over both a multiple sequence alignment and a pair representation. AlphaFold 2 reached experimental accuracy on the CASP14 benchmark and triggered the release of the AlphaFold Protein Structure Database, which now contains predictions for more than 200 million proteins. RoseTTAFold by David Baker's lab at the University of Washington achieved similar results through a three-track architecture. AlphaFold 3, announced in May 2024 in a Nature paper, extends the model to predict structures of complexes including DNA, RNA, ligands, and post-translational modifications, using a diffusion module conditioned on a graph encoder.
In small molecule drug discovery, GNNs power molecular property prediction, molecular generation, and docking. Chemprop by Yang and colleagues at the Massachusetts Institute of Technology, used by Andrew Collins and James Collins to discover the antibiotic halicin in 2020, is a directed MPNN trained on chemical libraries to predict antibacterial activity. DiffDock, presented at ICLR 2023, scores ligand poses by a diffusion model over translations, rotations, and torsions, with an equivariant GNN as the score network. GNoME for materials and MatterGen for generative materials design are described above.
PinSAGE by Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William Hamilton, and Jure Leskovec at KDD 2018 was the first GNN deployed at web scale. PinSAGE ran on a bipartite graph of 3 billion pins and 18 billion edges at Pinterest, generating embeddings for related pin recommendations through random walk sampling and importance pooling. The system replaced an existing collaborative filtering pipeline and was reported to improve recommendation quality by double-digit percentages.
UberEats described in 2019 a graph-based system for dish and restaurant recommendation. LinkedIn published several papers on heterogeneous GNNs for member-job matching. Alibaba's M2GRL uses multi-view GNNs over the Taobao product graph.
Google Maps switched its estimated time of arrival model to a GNN in 2020. Austin Derrow-Pinion, Jennifer She, David Wong, Oliver Lange, and collaborators described the system in a paper at CIKM 2021 and in Nature Machine Intelligence. The model treats road segments as edges and intersections as nodes, with per-segment travel time predicted by a GNN that aggregates spatial context. Google reported up to 50 percent reductions in ETA accuracy errors in several cities. Uber and DiDi have published similar systems.
Visa described in 2022 a transactional graph GNN that flags fraud rings by jointly considering accounts, devices, merchants, and IP addresses. PayPal, Stripe, Ant Group, and Tencent have published GNN-based anti-money-laundering pipelines. The advantage is the ability to detect coordinated patterns invisible from any single transaction.
Three major libraries dominate the field. All three implement the MPNN abstraction, ship hundreds of layers and datasets, and integrate with PyTorch, JAX, or TensorFlow.
| Library | Backend | First release | Maintainer | Notable feature |
|---|---|---|---|---|
| PyTorch Geometric (PyG) | PyTorch | 2019 | Matthias Fey, Jan Eric Lenssen (TU Dortmund, Kumo AI) | Largest model zoo, used in most academic papers |
| Deep Graph Library (DGL) | PyTorch, MXNet, TensorFlow | 2018 | Amazon AWS AI, NYU Shanghai, NYU | Distributed multi-machine training |
| jraph | JAX | 2020 | Google DeepMind | Functional, fast on TPU |
| Spektral | Keras, TensorFlow | 2019 | Daniele Grattarola | Easy Keras-style API |
| Stellar Graph | TensorFlow | 2018 | CSIRO Data61 | End-to-end pipelines |
| TensorFlow GNN | TensorFlow | 2021 | Production deployment in Google Cloud | |
| TorchDrug | PyTorch | 2021 | MILA | Drug discovery focus |
PyTorch Geometric (PyG), introduced by Matthias Fey and Jan Eric Lenssen in 2019 at the ICLR Representation Learning on Graphs workshop, is the de facto standard for academic graph learning. PyG implements over 100 layers, ships standard datasets (OGB, TU, PPI, Reddit), and supports heterogeneous graphs, temporal graphs, and explainability tools. The library is built on a sparse tensor backend and integrates with the rest of the PyTorch ecosystem.
Deep Graph Library (DGL), released by AWS AI Lab and NYU in 2018, offers a similar feature set with stronger support for distributed training on multi-machine clusters. DGL exposes a relational message passing API convenient for heterogeneous graphs and supports several backends. The DGL team also maintains DGL-KE for large-scale knowledge graph embedding.
jraph by DeepMind, released in 2020, is a minimal functional library for graph nets in JAX. It is used by the AlphaFold team and other DeepMind researchers. Spektral by Daniele Grattarola is a Keras-based library aimed at fast prototyping. Stellar Graph by CSIRO Data61 offers end-to-end pipelines. TensorFlow GNN (TF-GNN), released by Google in 2021, exposes a heterogeneous graph schema and integrates with TensorFlow Extended.
Graph learning benchmarks have evolved rapidly because early datasets (Cora, CiteSeer, PubMed) saturated quickly and were criticized for being too small and too easy. Several modern benchmark suites address these issues.
| Benchmark | Released | Scope | Notable property |
|---|---|---|---|
| TUDatasets | 2014 to 2020 | 120+ molecule, social, and biological graphs | Standard for graph classification |
| Cora, CiteSeer, PubMed | Pre-2010 | Citation networks | Classic transductive node classification |
| MoleculeNet | 2018 | Quantum, physiology, biophysics tasks | Pioneering chemistry benchmark |
| QM9 | 2014 | 134k small molecules, 12 quantum properties | Workhorse for equivariant networks |
| OGB | 2020 | Node, link, and graph tasks of various scales | Standardized splits, leaderboards |
| OGB-LSC | 2021 | KDD Cup 2021 large-scale tasks | PCQM4M, MAG240M, WikiKG90M |
| LRGB | 2022 | Long Range Graph Benchmark | Tests global mixing capability |
| GNN Benchmark suite | 2020 | Six tasks for fair comparison | Curated by Dwivedi and Bresson |
| Open Catalyst | 2020 | Catalyst materials simulation | 130 million DFT calculations |
| MalNet | 2021 | 1.2M function call graphs | Malware classification |
Open Graph Benchmark (OGB), introduced by Weihua Hu, Matthias Fey, Marinka Zitnik, and Jure Leskovec at NeurIPS 2020, defined standardized splits and metrics across graph sizes from small molecules to 100-million-edge citation graphs. OGB-LSC, released in 2021, scaled up to MAG240M (240 million nodes), WikiKG90M (90 million entities), and PCQM4M (4 million molecules). The OGB Large Scale Challenge at KDD Cup 2021 was won by Microsoft's Graphormer team on PCQM4M.
MoleculeNet by Zhenqin Wu, Bharath Ramsundar, and Vijay Pande in 2018 packaged 17 chemistry datasets covering quantum mechanics, physical chemistry, biophysics, and physiology, with scaffold splits that approximate generalization to new chemical scaffolds. LRGB (Long Range Graph Benchmark) by Dwivedi and colleagues in 2022 selected five datasets where information must propagate across many hops. LRGB is widely used to evaluate graph transformer architectures.
Despite the rapid progress, several fundamental issues constrain GNN performance and have driven much of the research agenda in the last five years.
Over-smoothing. Qimai Li, Zhichao Han, and Xiao-Ming Wu observed in 2018 that stacking many GCN layers causes node representations to converge to indistinguishable vectors, since repeated averaging acts like a low-pass filter on the Laplacian. The practical result is that most GCNs use only two or three layers, which limits the receptive field. Mitigations include residual connections (DeepGCN by Guohao Li and colleagues in 2019), PairNorm normalization, and the use of attention or gating to control diffusion.
Over-squashing. Uri Alon and Eran Yahav showed at ICLR 2021 that the exponential growth of the receptive field across layers combined with a fixed-size node representation forces the network to compress information from many distant nodes into a single vector, losing information about long-range dependencies. They demonstrated that GAT-like architectures suffer especially badly on synthetic tasks that need to combine information across many hops. Cristian Bodnar and colleagues at NeurIPS 2022 connected over-squashing to negative curvature in the underlying graph and proposed structural rewiring to alleviate it.
Expressivity bounds. The Xu et al. analysis at ICLR 2019 proved that any standard message passing GNN is at most as powerful as the 1-Weisfeiler-Lehman test, meaning there exist non-isomorphic graphs that no GCN, GraphSAGE, or GIN can distinguish. Several stronger frameworks have been proposed, including k-GNN (which simulates k-WL at cost exponential in k), Provably Powerful Graph Networks by Maron and colleagues, and identity-aware GNNs (ID-GNN) by You and colleagues at AAAI 2021. Subgraph-based approaches such as ESAN by Bevilacqua and collaborators at ICLR 2022 also push past the 1-WL barrier.
Scalability. Even with sampling, graphs over 100 million nodes remain difficult to train on. Industrial systems at Amazon and Alibaba rely on heavy engineering: custom sampling kernels, multi-GPU partitioning, and offline neighbor precomputation. Distributed training is harder than for images or text because of irregular memory access and neighborhood dependency structure.
Heterophily. Most early benchmarks were homophilous: connected nodes tend to share the same label. On heterophilous graphs (where neighbors usually disagree), basic GCN underperforms a simple MLP that ignores the graph entirely. Several architectures, including H2GCN by Zhu and colleagues at NeurIPS 2020 and GPR-GNN by Chien and colleagues, address heterophily by learning signed or per-hop coefficients.
Robustness. GNNs are sensitive to graph structure perturbations. A handful of strategic edge additions can change a node's predicted class. Daniel Zügner and Stephan Günnemann at KDD 2018 introduced Nettack, the first targeted adversarial attack on GNNs.