"A graph neural network is a machine that reads a graph and learns by listening to its neighbors - one round of message passing at a time, until every node knows not just what it is, but where it stands in the web of relationships that defines it."
Overview
Graph Neural Networks (GNNs) are deep learning architectures designed for graph-structured data - the most general and pervasive data structure in science, engineering, and AI. Social networks, molecules, knowledge bases, protein interaction maps, citation graphs, program abstract syntax trees, and even the attention pattern of a transformer are all graphs. The central question GNNs answer is: how do we learn representations of nodes, edges, and graphs that are sensitive to their structural context?
The key insight, formalized by Gilmer et al. (2017) as the Message Passing Neural Network (MPNN) framework, is that useful graph representations can be built iteratively: in each round, every node collects information from its immediate neighbors, transforms that information with a learned function, and updates its own representation. After rounds, every node's representation reflects the structure of its -hop neighborhood. This is the graph analogue of a convolutional layer in an image network, but operating on irregular, variable-size neighborhoods rather than fixed grids.
This section develops GNN theory from first principles. We begin with the formal setup (permutation invariance, task types), develop the MPNN framework as the unifying abstraction, and then study the major GNN families: Graph Convolutional Networks (Kipf & Welling, 2017), GraphSAGE (Hamilton et al., 2017), Graph Attention Networks (Velickovic et al., 2018), and Graph Isomorphism Networks (Xu et al., 2019). We then turn to two fundamental theoretical questions - expressiveness (what structural properties can a GNN detect?) and depth pathologies (what goes wrong as GNNs get deeper?) - and connect these to practical remedies. We close with Graph Transformers, training at scale, and a survey of modern applications including AlphaFold 2, knowledge graph reasoning, and LLM+GNN architectures.
Students who complete this section will be able to implement and reason about all major GNN architectures, understand the Weisfeiler-Leman expressiveness hierarchy, diagnose and fix over-smoothing and over-squashing, and read current GNN research papers independently.
Prerequisites
- Graph definitions: , vertex degrees, paths, connectivity - 11-01 Graph Basics
- Adjacency matrix , degree matrix , Laplacian - 11-02 Graph Representations
- BFS and -hop neighborhoods - 11-03 Graph Algorithms
- Spectral graph theory: normalized Laplacian, Chebyshev approximation, GCN spectral derivation, over-smoothing preview - 11-04 Spectral Graph Theory
- Eigenvalues, PSD matrices, the spectral theorem - 03-01 Eigenvalues and Eigenvectors
- Attention mechanisms and transformer architecture - 14 Math for Specific Models
- Gradient descent and backpropagation - 08 Optimization
Companion Notebooks
| Notebook | Description |
|---|---|
| theory.ipynb | Interactive derivations: GCN propagation, GAT attention visualization, WL color refinement, over-smoothing dynamics, graph pooling, neighbor sampling |
| exercises.ipynb | 8 graded exercises from MPNN implementation through WL expressiveness, over-smoothing diagnosis, and molecular property prediction |
Learning Objectives
After completing this section, you will:
- Implement the MPNN framework and express GCN, GraphSAGE, GAT, and GIN as special cases
- Derive the GCN layer from the spectral approximation and interpret it spatially
- Define permutation invariance and equivariance and explain why GNNs must satisfy these properties
- Implement neighbor sampling (GraphSAGE), attention-weighted aggregation (GAT), and multi-head graph attention (GATv2)
- State and prove the GNN expressiveness theorem (Xu et al. 2018): MPNNs are at most as powerful as 1-WL
- Implement the Weisfeiler-Leman color refinement algorithm and construct graphs that WL cannot distinguish
- Construct GIN and explain why sum aggregation is strictly more powerful than mean or max
- Quantify over-smoothing via Dirichlet energy decay and over-squashing via Jacobian analysis
- Apply DropEdge, PairNorm, residual connections, and graph rewiring to mitigate depth pathologies
- Describe the GPS framework and distinguish Laplacian PE, RWSE, and degree encoding for graph Transformers
- Explain how AlphaFold 2, PinSage, and Graph RAG (Microsoft 2024) use GNN mathematics in production
Table of Contents
- 1. Intuition
- 2. Formal Setup: Graphs as Data
- 3. The MPNN Framework
- 4. Graph Convolutional Networks (GCN)
- 5. GraphSAGE: Inductive Learning on Large Graphs
- 6. Graph Attention Networks (GAT)
- 7. Expressiveness and the Weisfeiler-Leman Test
- 8. Over-Smoothing, Over-Squashing, and Depth
- 9. Graph Pooling and Hierarchical GNNs
- 10. Graph Transformers
- 11. Training and Scaling GNNs
- 12. Applications in Machine Learning
- 13. Common Mistakes
- 14. Exercises
- 15. Why This Matters for AI (2026 Perspective)
- 16. Conceptual Bridge
1. Intuition
1.1 The Core Challenge: Learning on Non-Euclidean Data
Standard deep learning architectures - convolutional networks, recurrent networks, transformers - were designed for data that lives on regular, well-ordered domains. An image is a grid: every pixel has exactly the same number of neighbors, arranged in the same spatial pattern. A sentence is a sequence: every token has a left neighbor and a right neighbor, and positions are canonically ordered. These regular structures make convolution and positional encodings natural.
Real-world data is rarely so obliging. A molecule has atoms connected by bonds in a pattern determined by chemistry, not a grid. A social network has users connected by friendships in patterns determined by behavior and geography. A knowledge base has concepts connected by relations in patterns determined by the world's semantic structure. A protein's function is determined by the 3D arrangement of amino acid residues, which a graph can represent but a grid cannot. These are non-Euclidean data structures: they have no natural notion of translation, no canonical coordinate system, no uniform neighborhood size.
The challenge GNNs solve is: how do we define a learnable function on a graph that generalizes across different graphs, respects the graph's structure, and can be trained end-to-end by gradient descent? This is harder than it sounds. The same function applied to two isomorphic graphs (graphs with the same structure but differently labeled vertices) should produce the same output. The function must handle graphs of arbitrary size and arbitrary vertex degrees. And it must be expressive enough to detect structural patterns (triangles, paths of length , specific subgraph motifs) that determine the graph's properties.
For AI: In 2026, GNNs are production components in AlphaFold 2's structure module (atoms as nodes, spatial edges), PinSage (users and items as nodes, interactions as edges with 3 billion monthly active users), and Microsoft's Graph RAG system (entities and relations from documents as knowledge graphs for retrieval-augmented generation). Understanding GNNs is not optional for anyone working with structured, relational, or molecular data.
1.2 The Message-Passing Insight
The insight that makes GNNs possible is conceptually simple: a node's representation should depend on the representations of its neighbors, and this dependency should be computed iteratively.
Consider a social network node - a person. Their social identity depends on whom they know. But whom they know also depends on whom those people know, and so on. The first round of "listening to neighbors" gives each person a sense of their immediate social circle. The second round gives a sense of friends-of-friends. After rounds, each node's representation encodes the structure of its -hop neighborhood.
This is message passing: in each layer, every node sends a message to its neighbors, every node aggregates the messages it receives, and every node updates its representation based on the aggregated messages. Formally, for node at layer :
where is the set of neighbors of , is node 's representation at layer , and is the initial node feature.
The connection to BFS: Breadth-first search from node explores exactly the nodes reachable in steps at step . A -layer GNN computes a representation that depends on exactly the nodes reachable in steps. A BFS is the non-learned, binary-valued version of message passing: it answers "which nodes are in the -hop neighborhood?" whereas a GNN answers "what is the learned summary of the -hop neighborhood?"
For AI: This message-passing structure is the graph analogue of the convolutional receptive field. A pixel in a CNN "sees" a grid after convolutional layers. A node in a GNN "sees" its -hop neighborhood after message-passing layers. The critical difference: the CNN's receptive field grows quadratically in (for a 2D grid), while the GNN's receptive field can grow exponentially in (for a well-connected graph), which creates both expressiveness and scalability challenges we will study in 8.
1.3 Two Perspectives: Spectral vs Spatial
Graph convolution was first defined in the spectral domain. The key idea from 11-04 is that the graph Laplacian admits an eigendecomposition , and the columns of form a graph Fourier basis. A graph signal (one scalar per node) is filtered by applying a function to its Fourier coefficients:
where is a diagonal matrix of filter coefficients. Full spectral convolution requires computing the full eigendecomposition - cost, infeasible for large graphs.
Recall from 11-04 7.5: Kipf & Welling (2017) derived the GCN layer as a first-order Chebyshev polynomial approximation to spectral filtering, further simplified by setting and adding self-loops. This gives the propagation rule - see 11-04 7.5 for the full derivation.
Once we arrive at the GCN layer, the spectral scaffolding can be discarded and the layer reinterpreted spatially: it is simply normalized aggregation of neighbor features followed by a linear transformation and nonlinearity. This spatial reinterpretation is what makes GNNs extensible: one can design new aggregation functions, learn attention weights, incorporate edge features, and handle directed graphs - none of which fit neatly into the spectral framework.
The two perspectives are complementary:
- Spectral: principled derivation, frequency interpretation, connection to graph signal processing; limited to fixed graphs, requires eigendecomposition or polynomial approximation
- Spatial: flexible, scalable, inductive, handles heterogeneous graphs; less theoretically constrained, harder to analyze convergence
Modern GNN research lives primarily in the spatial perspective, with spectral theory providing theoretical grounding and positional encodings (10).
1.4 What Makes Graphs Different from Other Data
Four properties of graph-structured data require fundamentally different architectural choices:
1. Variable and irregular neighborhoods. Node in a molecule may have 1 bond (hydrogen) or 4 bonds (carbon). Node in a social network may have 3 friends or 3,000. Standard convolution assumes a fixed-size, fixed-pattern receptive field. GNNs must aggregate over neighborhoods of arbitrary size - which forces a choice of aggregation function (sum, mean, max, attention) that is insensitive to neighborhood size.
2. No canonical node ordering. An image has a canonical pixel ordering (row-major). A sentence has a canonical word ordering (left-to-right). A graph has no such ordering. If we relabel the nodes of a graph, we get the same graph - just with different indices. This means a GNN must be permutation equivariant (the output node representations permute consistently with any relabeling of input nodes) and any graph-level prediction must be permutation invariant (the same under any relabeling). This rules out any architecture that depends on node indices.
3. Structural information is in the topology. Two molecules with identical atom composition but different bond structures (structural isomers) have very different chemical properties. The GNN must detect this topological difference from the adjacency structure alone. This is the expressiveness question of 7: which topological patterns can a GNN distinguish?
4. Heterogeneous graphs. Real graphs often have multiple types of nodes and edges (knowledge graphs: Person --worksAt--> Company, Person --bornIn--> City). Relational GNNs (R-GCN) and heterogeneous GNNs handle this by type-specific transformation matrices. This section focuses on homogeneous graphs; heterogeneous extensions are discussed in 12.2.
1.5 Historical Timeline (2013-2026)
GRAPH NEURAL NETWORK HISTORICAL TIMELINE
========================================================================
2009 Scarselli et al. - Original GNN: fixed-point iteration on graphs
2013 Bruna et al. - Spectral graph convolution (full eigendecomp.)
2015 Defferrard et al. - ChebNet: Chebyshev polynomial spectral filters
2016 Li et al. - Gated GNN: GRU-based update functions
2017 Kipf & Welling - GCN: simplified spectral GNN for node classification
2017 Hamilton et al. - GraphSAGE: inductive learning with neighbor sampling
2017 Gilmer et al. - MPNN: unified message-passing framework
2018 Velickovic et al. - GAT: attention-weighted neighborhood aggregation
2019 Xu et al. - GIN + WL expressiveness theorem
2019 Ying et al. - DiffPool: differentiable hierarchical pooling
2019 Chiang et al. - Cluster-GCN: scalable training by graph partitioning
2020 Rong et al. - DropEdge: data augmentation against over-smoothing
2021 Ying et al. - Graphormer: transformer for graphs (OGB-LSC winner)
2021 Brody et al. - GATv2: fixed dynamic attention for GAT
2022 Rampasek et al. - GPS: general, powerful, scalable graph transformer
2023 Chen et al. - Graph Mamba: state-space models on graphs
2024 Microsoft - Graph RAG: GNNs for retrieval-augmented generation
2024 -2026 - LLM+GNN: language models with graph-structured memory
========================================================================
2. Formal Setup: Graphs as Data
2.1 Graph with Node and Edge Features
A featured graph is a tuple where:
- - the vertex set, nodes
- - the edge set, edges
- - node feature matrix: row is the feature vector of node
- - edge feature matrix: row is the feature of edge
The adjacency structure is encoded either as a dense matrix or (for large sparse graphs) as an edge list - a pair of tensors where is the source and is the target of edge .
Examples:
| Domain | Node features | Edge features | Task |
|---|---|---|---|
| Molecular graph | Atom type (one-hot), charge, hybridization | Bond type (single/double/aromatic), distance | Predict solubility |
| Social network | User demographics, activity history | Interaction frequency, timestamp | Link prediction |
| Knowledge graph | Entity type embedding | Relation type embedding | Triple completion |
| Citation network | Bag-of-words of paper abstract | None (unweighted) | Classify research area |
| Code AST | Token type, identifier embedding | Syntactic relation type | Bug detection |
Non-examples of graph data:
- An image stored as a grid: regular structure makes CNN superior; GNNs are overkill and less efficient
- A flat table of independent samples: no edges, no relational structure; standard ML applies
- A time series: sequential structure with canonical ordering; RNN/Transformer preferred unless interactions are truly non-sequential
2.2 Learning Tasks on Graphs
GNNs support three levels of prediction:
Node-level tasks: predict a label for each node (or a subset). The GNN produces a node embedding and a final classifier .
- Semi-supervised node classification: most nodes are unlabeled; a small fraction have known labels. The GNN propagates information from labeled to unlabeled nodes through the graph structure. Classic example: Cora/CiteSeer citation networks (Kipf & Welling, 2017).
- Node regression: predict a continuous value per node. Example: predicting traffic speed at road intersections.
Edge-level tasks: predict a property of a pair of nodes , whether or not the edge exists.
- Link prediction: does edge exist? Used in recommendation (predict user-item interaction) and knowledge graph completion (predict missing relation). The GNN produces and ; the prediction is (e.g., inner product or MLP).
- Edge classification: predict the type of an existing edge. Example: classify protein-protein interaction type (physical, genetic, etc.).
Graph-level tasks: predict a single label for an entire graph .
- Graph classification: classify molecular graphs as toxic/non-toxic; classify citation subgraphs by topic. Requires a readout function that aggregates all node embeddings into a single graph embedding .
- Graph regression: predict a scalar property of the graph (e.g., binding affinity, quantum energy). Most molecular property prediction benchmarks (OGB-molhiv, QM9) are graph regression tasks.
2.3 Permutation Invariance and Equivariance
Definition (Permutation Equivariance). Let be the group of permutation matrices. A function is permutation equivariant if for every permutation matrix :
That is, permuting the node ordering in the input permutes the output representations in the same way. Node-level GNNs must be permutation equivariant: if we relabel the nodes, the node embeddings should permute consistently.
Definition (Permutation Invariance). A function is permutation invariant if for every permutation :
Graph-level GNNs must be permutation invariant: the prediction for a graph is the same regardless of how we number its vertices.
Why this constrains architecture. Consider a simple approach: concatenate all node features into a vector and pass it through an MLP. This is not permutation equivariant - shuffling the nodes changes the input vector and changes the output. Similarly, computing any function of the matrix that uses specific row/column indices (like the entry) is not permutation invariant.
The only permutation-equivariant functions that aggregate neighborhoods are those that apply a fixed function to the set of neighbor features - which is precisely what GNN aggregation functions do. Functions of sets must be insensitive to the order of elements; sum, mean, and max all satisfy this; concatenation does not.
Formal example. Let (GCN-style). This is equivariant because sums over a set - reordering the neighbors does not change the sum. But for an ordered list of neighbors is not equivariant.
2.4 Representations: From Single Graph to Batched Graphs
Training GNNs on datasets of multiple small graphs (graph classification, molecular property prediction) requires efficient batching. Unlike image batches where tensors have uniform shape, graphs have variable numbers of nodes and edges.
The block-diagonal batch trick. Given graphs with nodes, form a single disconnected "super-graph" with nodes:
The adjacency matrix is block-diagonal (no edges between different graphs). Since graphs in the batch are disconnected, message passing does not cross graph boundaries. A batch index vector tracks which graph each node belongs to, enabling graph-level readout by aggregating within each block.
Virtual node trick. Adding a "super node" connected to all real nodes in a graph is a common trick to enable long-range information propagation without increasing depth. The virtual node aggregates the entire graph's information in one hop, then broadcasts back. Used in MPNN (Gilmer et al., 2017) and Graphormer (Ying et al., 2021). It also serves as a form of global pooling: .
3. The MPNN Framework
3.1 The Gilmer et al. (2017) Formulation
The Message Passing Neural Network (MPNN) framework, introduced by Gilmer et al. (2017) for quantum chemistry property prediction, provides the canonical mathematical abstraction for GNNs. It unifies GCN, GraphSAGE, GAT, GIN, and most other spatial GNN architectures as special cases.
Algorithm: Message Passing Phase
For (layers):
Readout Phase (for graph-level tasks):
Here:
- - message function at layer ; takes the central node's features, a neighbor's features, and the edge features, returns a message vector
- - update function at layer ; combines the node's current representation with the aggregated message
- - readout function; maps a set of node representations to a graph-level prediction
The aggregation in the message phase is written as a sum, but can be replaced by any permutation-invariant function (mean, max, attention). The sum formulation is canonical because it is the most expressive (see 7).
Initialization: - the initial representation is the input node feature.
For AI: The MPNN framework is so general that it also subsumes certain attention mechanisms. The transformer's self-attention layer can be viewed as a fully-connected MPNN (every token attends to every other token) with as the message function.
3.2 Message Functions
The message function determines what information flows along each edge. Several common designs:
Edge-independent messages. Many simple GNNs (GCN, GIN) ignore edge features and use:
or
Edge-conditioned messages. When edge features are available:
where is an MLP. Used in MPNN for molecular graphs (Gilmer et al., 2017): edge features encode bond type, bond length, and stereochemistry.
Pair messages. In models like NNConv, the edge feature defines the transformation matrix:
where is an MLP that produces a weight matrix from the edge features. This is extremely expressive but quadratically expensive in .
3.3 Aggregation: Permutation-Invariant Pooling
The aggregation function must be permutation invariant over the set . Three canonical choices and their properties:
Sum aggregation:
Sensitive to neighborhood size - a node with 10 neighbors gets a larger aggregate than a node with 2, even if their neighborhood compositions are identical. This turns out to be a feature, not a bug: sum aggregation is the most expressive (7.4). Used in GIN (Xu et al., 2019).
Mean aggregation:
Normalizes by neighborhood size. Good when neighborhood size is not informative and you want to compare average neighbor characteristics across nodes of different degrees. Used in GCN (Kipf & Welling, 2017) and GraphSAGE (mean variant).
Max aggregation:
Detects whether any neighbor has feature above a threshold. Good for detecting the presence of particular node types in the neighborhood. Used in GraphSAGE (max-pool variant).
Expressiveness hierarchy. Xu et al. (2019) proved: for injective graph classification, sum mean = max in expressive power. Mean cannot distinguish a graph with two nodes each of degree 1 from a graph with four nodes each of degree 2 if neighbor features are identical. Max cannot count copies of identical features. Sum can distinguish both.
Attention aggregation. Instead of fixed weights, learn attention coefficients:
where is a learned attention score. This is the GAT model (6).
3.4 Update Functions
The update function combines the node's current representation with the aggregated neighborhood message.
Linear update (GCN):
The current node representation is absorbed into the message (via including the self-loop).
Concatenation update (GraphSAGE):
Concatenating the node's current representation with the aggregated neighbor message before transforming. Preserves the node's own information explicitly.
GRU update (Gated GNN, Li et al. 2016):
The GRU's gating mechanism adaptively controls how much of the current representation to retain vs. replace with the new message. More expressive than linear update but more parameters.
Residual update:
Skip connections (as in ResNet) preserve gradient flow in deep GNNs and mitigate over-smoothing. Used in GCNII (Chen et al., 2020) for training 64-layer GCNs successfully.
3.5 Readout for Graph-Level Tasks
For graph classification and regression, the node representations must be aggregated into a single graph embedding .
Global pooling:
Simple choices: , , (element-wise). Sum is most expressive; mean normalizes for graph size.
Hierarchical pooling. Rather than a single pooling step at the end, hierarchical GNNs (DiffPool, SAGPool) alternate GNN layers with coarsening steps that reduce the number of nodes. This mirrors how image CNNs alternate convolution with spatial downsampling.
Jumping Knowledge (Xu et al., 2018). Rather than using only the final layer's representations, JK-networks concatenate representations from all layers:
This allows the readout to leverage both local (shallow layers) and global (deep layers) structural information, and empirically mitigates over-smoothing.
Set2Set (Vinyals et al., 2016). A more powerful readout using an LSTM-based attention over the node set, producing an order-invariant graph embedding that depends on all node representations through multiple rounds of attention. More expressive than simple global pooling but sequential steps.
3.6 MPNN Instances: Unifying GCN, GraphSAGE, GAT, GIN
The following table shows how each major GNN architecture is a special case of the MPNN framework:
MPNN UNIFICATION TABLE
========================================================================
Model Message M(h_v, h_u, e_uv) Aggregation Update U(h_v, m_v)
---------------------------------------------------------------------
GCN W*h_u Mean (norm.) \sigma(W*m_v)
GraphSAGE W_1*h_u Mean/Max/Sum \sigma(W*[h_v || m_v])
GAT \alpha_uv * W*h_u Attention \sigma(W*m_v)
(\alpha_uv learned from h_v, h_u)
GIN h_u Sum MLP((1+\epsilon)*h_v + m_v)
MPNN-Gilmer f_(h_u, e_uv) Sum GRU(h_v, m_v)
---------------------------------------------------------------------
Key insight: choice of M, aggregation, U determines expressiveness
and scalability. Sum agg. + injective MLP = most expressive (7)
========================================================================
4. Graph Convolutional Networks (GCN)
4.1 Spectral Foundation Recap
Recall from 11-04 7.5: The GCN layer is derived as a first-order Chebyshev polynomial approximation to spectral graph convolution. Starting from the spectral filter applied to the normalized Laplacian (with eigenvalues in ), constraining for a single parameter per layer, and applying the renormalization trick , , yields the GCN propagation rule. Full derivation: 11-04 Spectral Graph Theory 7.5.
The key takeaway is that the GCN layer is a principled spectral filter, not an ad hoc heuristic. The choice of (adding self-loops) and (symmetric normalization) both arise from the spectral derivation and can now be given spatial interpretations.
4.2 The GCN Layer
The GCN propagation rule for all nodes simultaneously (matrix form):
where:
- - adjacency matrix with added self-loops
- - degree matrix of
- - node feature matrix at layer ;
- - trainable weight matrix at layer
- - nonlinear activation (ReLU for hidden layers, softmax for output in classification)
Node-level form. For a single node :
where (degree including self-loop). Each neighbor 's contribution is weighted by - the geometric mean of the (augmented) degrees. This normalization prevents high-degree nodes from dominating and ensures the propagation matrix has spectral radius \leq 1.
The propagation matrix. Define . This is the normalized adjacency of . Its eigenvalues lie in (since is symmetric, non-negative). A key property:
After GCN layers, node 's representation is a learned function of the (weighted) -hop neighborhood - exactly the receptive field intuition from 1.2.
4.3 Matrix View: Propagation Rule
The GCN can be understood as a learned diffusion process. Define as the diffusion operator (also called the normalized propagation matrix). Then:
- - neighborhood aggregation: each node's features become a weighted average of its neighbors' features. This is a graph smoothing step - it reduces variation across connected nodes.
- - feature transformation: a learnable linear map on the -dimensional feature space.
- - nonlinearity: introduces expressive power.
Repeating this times gives .
Connection to heat diffusion. The continuous heat equation on a graph is , with solution . The GCN step approximates one discrete step of this diffusion. Deep GCNs approximate long-time diffusion, which - as we will see in 8 - causes all node representations to converge to the same value (over-smoothing).
4.4 Semi-Supervised Node Classification
The original application of GCN (Kipf & Welling, 2017) is semi-supervised node classification: given a graph with nodes, labeled and unlabeled, train the GCN to predict class labels for all nodes.
Setup:
- Two-layer GCN:
- Loss: cross-entropy over labeled nodes only:
- The GCN propagates label information from labeled to unlabeled nodes through the graph structure (similar to label propagation, but learned)
Benchmark results (Cora citation network, 140 labeled / 2708 total):
| Method | Accuracy |
|---|---|
| DeepWalk + SVM | 67.2% |
| Label Propagation | 68.0% |
| Planetoid (Yang et al. 2016) | 75.7% |
| GCN (Kipf & Welling, 2017) | 81.5% |
| GAT (Velickovic et al., 2018) | 83.0% |
| GCNII (Chen et al., 2020, 64 layers) | 85.5% |
The GCN's success on this benchmark established GNNs as the go-to method for graph-structured learning and spawned the entire field.
Why it works. The graph encodes a "homophily" assumption: connected nodes tend to have the same class. In citation networks, papers citing each other tend to be in the same research area. The GCN leverages this assumption by smoothing features across edges, effectively propagating class information from labeled nodes to their unlabeled neighbors.
4.5 Strengths and Weaknesses of GCN
Strengths:
- Simple to implement (two matrix multiplications per layer)
- Efficient: per layer for sparse graphs
- Well-understood theoretically: spectral derivation, convergence analysis
- Strong baselines for semi-supervised node classification
- Easily extended with residual connections, BatchNorm, dropout
Weaknesses:
- Transductive: the normalization is computed for the entire graph at training time. Adding a new node requires recomputing and retraining - the GCN cannot generalize to unseen nodes
- Fixed symmetric aggregation: every neighbor is weighted by the same geometric mean of degrees. High-degree hub nodes get down-weighted regardless of their relevance
- No edge features: the standard GCN layer has no mechanism to incorporate edge attributes
- Expressiveness limited to 1-WL: cannot distinguish non-isomorphic graphs that the WL test cannot distinguish (7)
- Over-smoothing at depth: after many layers, all node representations converge (8)
5. GraphSAGE: Inductive Learning on Large Graphs
5.1 The Inductive Learning Problem
GCN is transductive: it operates on a fixed graph seen during training. The normalized adjacency encodes the training graph's structure. When new nodes arrive (a new user on a social platform, a new molecule to screen), the GCN cannot produce representations without recomputing the full graph and rerunning the entire network.
This is impractical for large, dynamic graphs. Pinterest has billions of users; new pins are added every second. Training a fresh GCN daily is infeasible.
Inductive learning solves this: learn an aggregation function that can be applied to the neighborhood of any node, including nodes unseen during training. If the function is parameterized correctly, it generalizes: apply it to the neighborhood of a new node and get a useful embedding without retraining.
GraphSAGE (Hamilton, Ying & Leskovec, 2017) - SAmple and aggreGatE - is the canonical inductive GNN framework.
5.2 GraphSAGE Framework
Key idea: instead of using the entire neighborhood of a node (which can be millions of neighbors for hub nodes), sample a fixed-size subset at each layer. Then aggregate only over the sampled set. The parameters of the aggregation function are shared across all nodes and all graphs.
Algorithm (2-layer GraphSAGE):
For each node in the target set (e.g., a mini-batch of nodes):
- Sample neighbors: - sample neighbors uniformly
- For each sampled neighbor , sample their neighbors:
- Compute depth-2 embeddings bottom-up:
- for all
- Normalize:
Inductive inference: for a new node with known features and edges (to training-time nodes), apply the same aggregation functions with the learned - no retraining needed.
Sampling depth. A -layer GraphSAGE with neighborhood sizes touches at most nodes per target node. With , at most 250 nodes are accessed per target node, regardless of graph size. This makes training on massive graphs feasible.
5.3 Aggregation Strategies
GraphSAGE defines three aggregation variants with different expressiveness-efficiency tradeoffs:
Mean aggregator:
Similar to GCN but with uniform weights (no degree normalization). Concatenation rather than addition preserves the distinction between self and neighborhood.
LSTM aggregator:
Apply an LSTM over a random permutation of the sampled neighbors. The LSTM can model complex interactions between neighbors (e.g., in a sequence-like order), but the random permutation breaks the theoretical permutation invariance. In practice, the LSTM often performs best on benchmark tasks due to its representational power.
Max-pool aggregator:
Apply a learnable transformation to each neighbor's representation, then take the element-wise max. Detects the presence of specific feature patterns in the neighborhood. Often the best choice for tasks where the presence of a rare feature type is informative.
Choosing an aggregator. Empirically: LSTM aggregator performs best on benchmark accuracy but is not permutation invariant. Mean aggregator is permutation invariant and performs well on inductive tasks. Max-pool aggregator is best at detecting rare structural patterns. Practitioners typically tune on validation data.
5.4 Unsupervised Training and Link Prediction
GraphSAGE can be trained without labels using a link prediction objective:
where is a positive pair (connected in the graph), is a negative sample (random non-neighbor), is the number of negative samples, is the negative sampling distribution (typically uniform), and is the sigmoid function.
Intuition: nodes that co-occur in short random walks should have similar embeddings; nodes that are far apart should have dissimilar embeddings. This is the DeepWalk objective extended to learned GNN embeddings.
For AI: unsupervised GraphSAGE embeddings are used in PinSage (Pinterest's recommendation system) to generate embeddings for pins (images with descriptions) in a bipartite user-pin-board graph. These embeddings power Pinterest's "more like this" recommendations, serving 3+ billion monthly active users.
5.5 Scaling to Billion-Node Graphs
PinSage (Ying et al., 2018) adapted GraphSAGE to Pinterest's graph with 3 billion nodes and 18 billion edges - the largest graph neural network deployment at the time.
Key engineering innovations:
- Importance-based sampling: instead of uniform random sampling, sample neighbors proportional to their random walk visit frequency from the target node. This gives more weight to structurally important neighbors and reduces variance.
- Producer-consumer minibatch pipeline: precompute the node neighborhoods offline in Spark, stream them to GPU workers as training data. Decouples graph traversal (CPU-bound) from gradient computation (GPU-bound).
- Hard negative mining: the standard negative sampling produces easy negatives (random pins). Hard negatives are pins that are visually similar but semantically different - they force the model to learn finer-grained distinctions.
- Curriculum learning: train first with easy negatives, then add hard negatives progressively.
Deployment results: 150% improvement in head-to-head pin recommendation quality over previous collaborative filtering system, with recommendations updated daily across 3+ billion items.
6. Graph Attention Networks (GAT)
6.1 The Attention Mechanism on Graphs
GCN's aggregation weights are fixed functions of node degrees - they depend only on the graph structure, not on node features. Two neighbors with identical degree contribute identically regardless of their relevance to the central node's representation.
Graph Attention Networks (Velickovic et al., 2018) replace fixed weights with learned, data-dependent attention coefficients. For each edge , the attention coefficient is computed as a function of both nodes' features, allowing the network to focus on the most relevant neighbors.
This mirrors the transformer's self-attention, but with attention constrained to the graph's edge structure: node only attends to its actual neighbors , not to all nodes (which would be ). The graph acts as a structural prior on the attention pattern.
6.2 GAT Layer: Attention Coefficients
Step 1: Linear transformation. Apply a shared weight matrix to transform all node features:
Step 2: Attention scores. For each edge (including self-loops), compute:
where is a learnable attention vector, and denotes concatenation. The raw attention score measures the "relevance" of node to node . LeakyReLU (with negative slope ) allows gradient flow for zero scores.
Step 3: Normalize. Apply softmax over the neighborhood to get normalized coefficients:
Note: and .
Step 4: Aggregation. Compute the updated representation:
MPNN view: the message is and the aggregation is sum.
6.3 Multi-Head Attention for Graphs
Single-head attention can be unstable and may attend to a limited range of features. Multi-head attention stabilizes training and allows the model to jointly attend to different aspects of neighbors.
Multi-head GAT layer. Run independent attention mechanisms in parallel, each with its own parameters :
where denotes concatenation. The output dimension is .
For the final layer (before a classification head), averaging is often preferred over concatenation:
Complexity. A -head GAT layer costs for attention computation ( edges, features) and for the linear transformations. For dense attention (fully-connected graph), this is - the same scaling as transformer attention.
6.4 GATv2: Fixing the Static Attention Problem
Brody, Alon & Yahav (2022) identified a fundamental limitation of the original GAT: its attention is static - the ranking of a neighbor's importance does not depend on the query node.
The problem. In GAT, the attention score is:
Since this is a sum of two terms - one depending only on , one depending only on - the ranking of neighbors vs by their score vs is independent of . The "most important neighbor" is the same regardless of which node is doing the asking. This is "static attention."
Dynamic attention (GATv2). Fix by swapping the order of the linear transformation and concatenation:
Now the LeakyReLU nonlinearity is applied before the dot product with , creating genuine interaction between and . The ranking of vs can now depend on - the attention is dynamic.
Formal claim (Brody et al., 2022): GAT is a strictly less powerful attention function than GATv2. There exist graphs where GAT's attention collapses to uniform weighting while GATv2's attention correctly assigns non-uniform importance. GATv2 matches or exceeds GAT on all standard benchmarks.
6.5 Attention Sparsity and Interpretability
A commonly cited advantage of GAT over GCN is interpretability: the attention coefficients can be read as edge importance scores. Edges with high are "important" for node 's representation; edges with near-zero are "ignored."
Limitations of this interpretation. Jain & Wallace (2019) and Wiegreffe & Pinter (2019) showed for NLP models that attention weights do not reliably indicate feature importance. Similar caveats apply to GAT:
- Attention weights are post-softmax and sum to 1; a weight of out of 3 neighbors is very different from out of 300 neighbors.
- The softmax creates competition between neighbors - a "dominant" neighbor may capture high attention simply by having a large dot product with , not because it is semantically important.
- Different attention heads may attend to completely different structural patterns, and the meaning of each head is not determined a priori.
Despite these caveats, in practice, GAT attention patterns do reveal meaningful structural patterns for molecular and knowledge graph tasks, especially when validated against domain knowledge.
7. Expressiveness and the Weisfeiler-Leman Test
7.1 The Graph Isomorphism Problem
Definition (Graph Isomorphism). Two graphs and are isomorphic (written ) if there exists a bijection such that .
Two isomorphic graphs are structurally identical - they differ only in how the vertices are labeled. For any function of graph structure (predicting molecular properties, classifying graph type), the function must produce the same output on isomorphic graphs. This is exactly the permutation invariance requirement from 2.3.
The graph isomorphism (GI) problem - determining whether two given graphs are isomorphic - is one of the few natural problems not known to be in P or NP-complete. For practical purposes (GNN expressiveness), we ask a weaker question: can a GNN distinguish two non-isomorphic graphs by assigning them different representations?
Non-isomorphic graphs that look similar. Consider:
Graph 1: Two disjoint triangles (C_3 \cup C_3)
o-o o-o
\ / \ /
o o
Graph 2: One 6-cycle (C_6)
o-o-o
| |
o-o-o
Both graphs have 6 nodes, each of degree 2 - same degree sequence. Are they isomorphic? No: has two triangles; has no triangles. A GNN should assign them different representations. Can it?
7.2 1-WL and Color Refinement
The Weisfeiler-Leman (WL) graph isomorphism test (Weisfeiler & Leman, 1968) is an efficient algorithm for testing graph isomorphism. While not complete (there exist non-isomorphic pairs it cannot distinguish), it correctly identifies isomorphism for most practical graphs.
1-WL Color Refinement Algorithm:
Initialize: assign all nodes the same color (or, for node-attributed graphs, ).
Iterate for until convergence:
where denotes a multiset (duplicate elements are preserved) and HASH is an injective hash function on (color, multiset of colors) pairs.
Decision: If at any iteration, graphs and have different multisets of node colors, declare them non-isomorphic. If the algorithm stabilizes with the same color multisets, declare them (possibly) isomorphic.
Convergence: The algorithm stabilizes in at most iterations (when no color changes occur). This gives time complexity.
Example. For vs with no node attributes:
- : all nodes color - same for both graphs
- : same for all nodes (all have degree 2 with same-colored neighbors) - still same!
- Algorithm stabilizes: 1-WL cannot distinguish from - both are "two disjoint 3-regular rings."
This is a fundamental limitation: 1-WL cannot detect triangles (3-cycles), so neither can any MPNN (Theorem below).
7.3 The GNN Expressiveness Theorem
Theorem (Xu et al., 2019). Let be any MPNN with a fixed number of layers and countably many colors (feature values). Then:
- If assigns different representations to graphs , then 1-WL also distinguishes and (in iterations).
- For any pair that 1-WL distinguishes, there exists an MPNN that also distinguishes them.
Corollary. The discriminative power of any MPNN is bounded above by 1-WL. Conversely, the most expressive MPNN achieves exactly the discriminative power of 1-WL.
Proof sketch (upper bound). At each MPNN layer, node 's representation is a function of: . This is precisely what 1-WL computes. If the aggregation + update function is injective (different inputs -> different outputs), the MPNN exactly simulates 1-WL. If it is not injective (e.g., mean aggregation), it may collapse distinct multisets to the same representation, making it strictly weaker than 1-WL.
Implications:
- MPNNs with mean or max aggregation are strictly weaker than 1-WL - they collapse some distinct multisets
- MPNNs with sum aggregation and injective MLP update are exactly 1-WL - GIN achieves this
- No MPNN (regardless of architecture) can distinguish graphs that 1-WL cannot distinguish (e.g., vs , or regular graphs of the same degree)
- Expressiveness limitations are fundamental, not a matter of training or capacity
7.4 GIN: The Most Expressive 1-WL GNN
Graph Isomorphism Network (Xu et al., 2019). To achieve 1-WL expressiveness, the aggregation function must be injective on multisets. Xu et al. prove:
Lemma. Any injective function on multisets over a countable universe can be expressed as:
for some functions . In other words, sum aggregation with an MLP is a universal approximator for injective multiset functions.
GIN layer:
where is either learned or set to . The term allows the network to distinguish the central node from its neighbors (otherwise, a node with features and a neighbor also with features would be aggregated indistinguishably from the node having two neighbors with features each under mean aggregation).
Why mean and max fail:
| Aggregation | Counterexample multisets | Behavior |
|---|---|---|
| Mean | vs | Both give mean = 1 - indistinguishable |
| Max | vs | Both give max = 2 - indistinguishable |
| Sum | vs | Sums 2 vs 3 - distinguishable |
GIN for graph classification. Use sum readout at each layer and combine across layers:
This jumping-knowledge-style readout captures structural patterns at multiple scales.
7.5 Beyond 1-WL: Higher-Order GNNs
1-WL's limitations (e.g., failing to count triangles, failing on regular graphs) motivate higher-order extensions.
-WL test. Instead of coloring individual nodes, color -tuples of nodes . The refinement rule considers the colors of all -tuples that differ in exactly one position. -WL is strictly more powerful than -WL for all .
-GNN (Morris et al., 2019). Implement -WL as a GNN by passing messages between -tuples. The cost is - prohibitive for on large graphs.
NGNN and subgraph GNNs. A more practical approach: instead of node-level MPNNs, run MPNNs on subgraphs induced by -hop neighborhoods, ego-graphs, or sampled subgraphs. These can detect cycles, cliques, and other motifs that 1-WL misses. Examples: NGNN (Zhang & Li, 2021), OSAN (Zhao et al., 2022).
Practical trade-off:
| Method | Expressiveness | Complexity | Practical use |
|---|---|---|---|
| 1-WL MPNN (GIN) | 1-WL | Standard; most applications | |
| Subgraph GNNs | 1-WL | Medium graphs, chemistry | |
| -GNN () | 3-WL | Small graphs only | |
| Random features | Universal | Simple, effective in practice |
7.6 Structural Features and Distance Encoding
A practical alternative to higher-order GNNs: augment node features with handcrafted structural descriptors that help the GNN detect patterns it otherwise cannot.
Random Walk Structural Encoding (RWSE). For each node , compute the landing probability of a random walk of length starting and ending at :
The vector encodes the local loop structure (triangles, 4-cycles, etc.). Used in GPS (Rampasek et al., 2022) and Graphormer.
Laplacian Positional Encoding (LapPE). Use the first eigenvectors of the normalized Laplacian as node features: . (Reviewed in 11-04 9.5 - see there for the sign invariance issue and fix.)
Degree features. Simply appending the node's degree as a feature allows the GNN to distinguish nodes of different degree even when 1-WL would assign them the same color (since 1-WL with uniform initial colors cannot use degree beyond what the refinement computes). More sophisticated: distance encoding (Li et al., 2020) appends the shortest-path distances from a node to a set of anchor nodes.
8. Over-Smoothing, Over-Squashing, and Depth
8.1 Over-Smoothing: Formal Analysis
A 2-layer GCN typically outperforms a 6-layer GCN on standard node classification benchmarks. This seems paradoxical: more layers should give access to more of the graph. The reason is over-smoothing: as depth increases, all node representations converge to the same vector, destroying the discriminative information needed for classification.
Formal statement (Li et al., 2018). Let be the GCN propagation matrix (assume for simplicity). Then:
where is the stationary distribution of the random walk on : . That is, the -step propagation from any starting point converges to the stationary distribution regardless of initial conditions.
Consequence. For a linear GCN (no nonlinearities), . As :
All nodes collapse to a constant multiple of the graph-wide feature average. For connected graphs, converges to a rank-1 matrix - all node representations are proportional to .
With nonlinearities (ReLU), convergence is not exact, but the trend holds: after 6-8 layers, node representations on typical graphs (small-world, power-law degree distribution) are nearly identical.
MADGap metric (Chen et al., 2020). Mean Average Distance (MAD) measures pairwise distances between node representations. MADGap = MAD(between-class) - MAD(within-class). For a good classifier, MADGap should be large (between-class distances large, within-class distances small). Over-smoothing drives MAD -> 0, collapsing MADGap.
8.2 Information-Theoretic View via Dirichlet Energy
The Dirichlet energy of the node representation matrix measures total variation across edges:
where is the unnormalized Laplacian.
Properties:
- for all connected (on a connected graph: all nodes identical)
- is large when adjacent nodes have different representations - high "frequency content"
- The GCN smoothing step reduces : where (since is the low-pass graph filter)
Over-smoothing = energy dissipation. Each GCN layer reduces . After layers:
Since for symmetric normalized with self-loops, the energy decays geometrically. The rate depends on (the Fiedler value): graphs with large spectral gap (expanders) over-smooth faster than graphs with small spectral gap (clusters).
Practical implication. For clustered graphs (molecular graphs, citation networks with strong community structure), over-smoothing is slow - one can use deeper GNNs. For expander-like graphs (dense social networks, random graphs), over-smoothing is fast - 2-4 layers is optimal.
8.3 Over-Squashing: Bottleneck Nodes
Over-squashing is a distinct (and more subtle) problem: information from distant nodes is exponentially compressed as it flows through bottleneck edges, preventing long-range interactions from influencing predictions.
Jacobian analysis (Alon & Yahav, 2021). Consider the Jacobian of node 's representation at layer with respect to node 's initial feature:
where the product runs along any path from to of length . The norm of this Jacobian is bounded by:
where is the entry of the -th power of the propagation matrix. For nodes far apart in the graph, is exponentially small in the distance .
The bottleneck. If and are separated by a single edge of high betweenness centrality (a "bridge"), all information from -side to -side must pass through . The aggregation at receives messages from all neighbors and compresses them into a single -dimensional vector - losing information at rate proportional to .
Over-squashing vs over-smoothing. These are dual problems:
- Over-smoothing: too much aggregation -> representations become too similar
- Over-squashing: aggregation is a bottleneck -> distant information cannot influence predictions
- Over-smoothing is a property of the graph and its spectrum; over-squashing is a property of specific bottleneck edges
8.4 Remedies for Over-Smoothing
DropEdge (Rong et al., 2020). Randomly remove edges during training: replace with a sparse mask that drops each edge independently with probability . This is analogous to dropout for neurons but applied to edges. It slows the convergence of over-smoothing (less propagation per layer) and acts as a data augmentation method. Improves performance at depths 4-8 on standard benchmarks.
PairNorm (Zhao & Akoglu, 2020). After each GCN layer, re-normalize the node representations to ensure a fixed total pairwise distance:
where is a scaling hyperparameter. PairNorm prevents the Dirichlet energy from decaying to zero by keeping pairwise distances bounded below. Empirically effective for deep GCNs (8-16 layers).
Initial residual connection (GCNII, Chen et al., 2020).
Two additions: (1) - always mixing in the initial feature with weight prevents full convergence; (2) the weight matrix mixes between identity (, pure smoothing) and full linear transform (). With (decreasing with depth), GCNII successfully trains 64-layer GCNs, achieving state-of-the-art on Cora at the time (85.5% accuracy vs 81.5% for 2-layer GCN).
GroupNorm / NodeNorm. Normalize within groups of nodes to prevent scale collapse. Less commonly used than the above.
8.5 Graph Rewiring
A more aggressive approach: change the graph structure to improve information flow before running the GNN.
Diffusion-based rewiring (DIGL, Gasteiger et al., 2019). Replace with the heat kernel approximation (truncated) or personalized PageRank matrix . The diffusion matrix connects distant nodes with non-zero weights, enabling long-range information flow in 1 GNN layer. Edges with small weights are pruned, keeping the graph sparse.
Expander Graph Propagation (EGP, Deac et al., 2022). Augment the graph with the edges of a sparse expander graph (a -regular graph with large spectral gap, e.g., a Ramanujan graph). Expander edges provide "shortcuts" that reduce the effective diameter of the graph, alleviating over-squashing without the cost of full connectivity.
FoSR: First-Order Spectral Rewiring (Karhadkar et al., 2023). Add edges that most increase (the Fiedler value), directly attacking the spectral bottleneck. Greedy algorithm: at each step, add the edge that maximally increases of the augmented graph. With added edges, the graph diameter decreases and over-squashing is reduced.
8.6 Depth vs Width Trade-off in GNNs
Why shallow GNNs dominate in practice. On most benchmark tasks (node classification on citation networks, graph classification on molecular benchmarks), 2-4 GNN layers achieve the best performance. The reasons:
- Homophily dominates: in social and citation networks, the most informative neighbors are at distance 1-2. Beyond distance 3, nodes are often from different classes, and aggregating them hurts classification.
- Over-smoothing: as analyzed above, deep GCNs lose discriminative information.
- Exponential neighborhood growth: a node's -hop neighborhood in a power-law graph has nodes. For and , that's 100,000 nodes - mostly irrelevant.
When depth helps. Long-range tasks where predictions depend on distant nodes:
- Predicting molecular properties that depend on the entire molecular graph (QM9 quantum chemistry dataset)
- Reasoning over knowledge graphs with long inference chains
- Planning and path-finding tasks on sparse graphs
For these tasks, graph Transformers (10) - which compute full pairwise attention in - often outperform deep GNNs.
Jumping Knowledge Networks (Xu et al., 2018). A principled approach: each node selects the representation from the most appropriate depth:
COMBINE can be concatenation, LSTM-attention, or max-pooling across layers. This allows different nodes to "listen" at different distances - nodes in dense clusters use shallow representations; nodes that are bridges benefit from deeper representations.
9. Graph Pooling and Hierarchical GNNs
9.1 Why Pooling Is Non-Trivial on Graphs
In image CNNs, pooling is straightforward: reduce a spatial block to a single pixel by averaging. The operation is well-defined because images have a fixed grid structure.
On graphs, pooling (coarsening) must answer: which nodes get merged? How are their features combined? How is the edge structure of the coarsened graph defined? All of these must be permutation invariant - the coarsened graph cannot depend on how vertices are labeled.
The challenges:
- No canonical merging: unlike pixels in a grid, there is no obvious spatial proximity to guide merging. Spectral methods (9.4) use the Fiedler vector; learned methods (9.3) learn soft assignments.
- Edge reconstruction: after merging nodes into a super-node , which edges does inherit? Typically all edges incident to or - but this may create multi-edges and self-loops.
- Information loss: pooling irreversibly reduces the graph. Unlike deconvolution in image models, there is no standard graph "unpooling" that perfectly recovers the original structure.
9.2 Global Pooling Methods
For graph-level tasks, a single pooling step at the end suffices. The readout function maps the set of node representations to a single graph embedding.
Sum pooling: . Sensitive to graph size (larger graphs get larger embeddings). Most expressive for graph-level tasks (by the same argument as sum aggregation for node-level).
Mean pooling: . Normalizes for graph size; best when comparing graphs of different sizes.
Max pooling: . Detects whether any node has feature above a threshold; good for detecting rare structural motifs.
Attention pooling (gated global pooling):
where scores each node, and transforms the features. Allows the model to focus on the most informative nodes. Used in Graph U-Net, Graphormer.
Set2Set (Vinyals et al., 2016). An LSTM-based readout that runs steps of attention over the node set, accumulating a context vector:
More expressive than simple pooling but with sequential steps.
9.3 DiffPool: Differentiable Graph Pooling
DiffPool (Ying et al., 2019) learns soft cluster assignments to hierarchically coarsen the graph. At each pooling level, run two GNNs in parallel:
where is the number of nodes at level , is the target number of clusters, is the soft assignment matrix (each node assigns fractionally to each of clusters), and is the GNN embedding of current-level nodes.
Coarsened graph:
The coarsened adjacency is dense - every pair of clusters has a (soft) connection weighted by the total edge weight between them.
Auxiliary losses. DiffPool adds two regularization terms to the main task loss:
- Link prediction loss: - encourages clusters to correspond to connected subgraphs
- Entropy loss: - encourages crisp (non-uniform) cluster assignments
Limitation. DiffPool produces a dense at each level - memory. For large graphs (), this is prohibitive.
9.4 MinCutPool and Spectral Pooling
MinCutPool (Bianchi et al., 2020) addresses DiffPool's density problem by formulating pooling as a spectral clustering problem with a differentiable objective.
The mincut objective. For a soft cluster assignment , the normalized minimum cut is:
Minimizing this over soft assignments is equivalent to spectral clustering - the optimal consists of the Fiedler eigenvectors of . MinCutPool uses a GNN to produce and trains end-to-end with the minCUT loss plus an orthogonality regularizer to prevent degenerate cluster assignments.
Advantage over DiffPool: the objective is theoretically motivated by spectral graph theory; the sparse regularization keeps the assignment matrix well-conditioned.
9.5 SAGPool: Self-Attention Graph Pooling
SAGPool (Lee et al., 2019) takes a different approach: instead of soft cluster assignments, select the top- nodes by a learned importance score and discard the rest.
Algorithm:
- Run one GNN layer to compute node scores:
- Select top- nodes:
- Gate the selected features:
- Induce the subgraph: (restrict to selected nodes)
Advantages: simple, interpretable (which nodes are selected?), maintains sparsity of adjacency. Disadvantage: hard top- selection is not differentiable (uses straight-through estimator or continuous relaxation during training).
10. Graph Transformers
10.1 Motivation: GNNs vs Transformers
The fundamental constraint of MPNNs is locality: a -layer GNN can only aggregate information within a -hop neighborhood. For long-range dependencies (e.g., two atoms on opposite ends of a large molecule that interact through space), a deep GNN would need layers - running into over-smoothing and over-squashing.
The transformer architecture - with full pairwise attention - solves the locality problem: every node can directly attend to every other node in a single layer. But transformers treat inputs as sets of independent tokens; they have no built-in notion of graph structure.
Graph Transformers combine: (1) the global receptive field of transformers (full pairwise attention), with (2) graph-structural inductive biases (local message passing, positional encodings derived from the graph topology).
The trade-off: full attention costs per layer - practical only for small-to-medium graphs (). For large graphs (), neighbor-sampled MPNNs remain the only scalable option.
10.2 Positional Encodings for Graphs
In transformers, positional encodings break the permutation symmetry of the attention mechanism: without them, the transformer cannot distinguish token position, and all orderings of the same tokens produce the same output. The same problem occurs in graph transformers: the attention mechanism is permutation invariant, so we need positional encodings that break symmetry in a structure-aware way.
Recall from 11-04 9.5: Laplacian eigenvectors provide a natural graph Fourier basis, and the first eigenvectors of give a -dimensional coordinate for each node that reflects the graph's spectral structure. The sign invariance problem (eigenvectors are defined up to sign) is addressed by random sign flipping during training or by using absolute values. Full treatment: 11-04 9.5.
Building on this, three main PE strategies for graph transformers:
Laplacian PE (LapPE). For each node , extract the -th row of the matrix where are the first eigenvectors of (excluding the constant eigenvector). Append to node features:
Sign ambiguity fix: during training, randomly flip the sign of each eigenvector independently (this has no effect on the Laplacian spectrum but makes the model invariant to sign choice). Used in Dwivedi et al. (2020), GPS (2022).
Random Walk SE (RWSE). For each node and walk length , compute - the probability of a length- random walk returning to . Stack for :
RWSE is always positive and sign-free (no sign ambiguity). It encodes local loop structure: iff has no self-loops; iff any neighbor of is also connected back to (triangles). Used in GPS (2022), GIN+RWSE.
Degree and centrality encoding. Simply append (node degree), normalized closeness centrality, or betweenness centrality as scalar node features. Used in Graphormer (2021), which also encodes shortest-path distances as edge features.
10.3 GPS Framework
General, Powerful, Scalable (GPS) Graph Transformer (Rampasek et al., 2022) is the most general graph transformer framework, winning multiple tracks of the 2022 Long Range Graph Benchmark (LRGB).
GPS Layer:
Each GPS layer combines:
- Local MPNN: any standard GNN (GCN, GINE, GAT) operating on the sparse graph edges - captures local structure efficiently
- Global Transformer: full multi-head self-attention over all nodes - captures long-range dependencies
- LayerNorm + residual: stabilizes training
Modularity. GPS is a framework, not a fixed architecture: the MPNN and Transformer components are interchangeable. In practice, GINE (GIN with edge features) + Transformer works well; one can also use GAT + Performer (linear attention approximation) for scalability.
Positional encodings. GPS accepts arbitrary node-level PEs (LapPE, RWSE, degree) as extra input features, processed by a dedicated PE encoder:
Performance. On the LRGB Peptides-func benchmark (a graph where predictions require integrating global molecular structure), GPS achieves significantly higher accuracy than any pure MPNN, demonstrating the necessity of global attention for long-range tasks.
10.4 Graphormer
Graphormer (Ying et al., 2021) uses a standard transformer architecture (with full pairwise attention) augmented by three graph-structural encodings. It won the OGB-LSC 2021 quantum chemistry track.
Central encoding. Add a degree-dependent bias to the node features:
where and are learned embeddings for in- and out-degree. This allows the transformer to distinguish hub nodes from leaf nodes.
Spatial encoding. Add a bias to the attention score between nodes and based on their shortest-path distance :
where is a scalar learned per distance . Nodes far apart get a learned attention bias; nodes nearby get a different bias. This encodes graph topology directly into the attention pattern.
Edge encoding. For each pair , average the edge features along the shortest path:
where is the feature of path edge and is a learned vector. Added to the attention score as an additional bias.
10.5 Graph Mamba and Sequence-Based Methods
The quadratic cost of full attention makes graph transformers impractical for large graphs. Recent work (2023-2024) explores linear-complexity alternatives.
Converting graphs to sequences. Several methods serialize the graph into a sequence of tokens, then apply a sequence model (LSTM, Mamba SSM):
- BFS/DFS ordering: nodes in BFS/DFS traversal order; nearby nodes in the graph -> nearby in sequence. Breaks permutation invariance (different traversal orderings give different results).
- Node-and-edge tokenization (TokenGT, Kim et al., 2022): represent each node and each edge as a separate token; full transformer attention over all tokens. Permutation invariant if node/edge PEs are orthogonalized.
Graph Mamba (Chen et al., 2023). Apply Mamba (state-space model with selective scan) to graphs by: (1) ordering nodes by degree or BFS, (2) running the SSM along this ordering as a "linearized graph sequence." Achieves complexity while matching transformer performance on some benchmarks. The key challenge: the SSM must be made robust to different node orderings (since graphs have no canonical ordering).
Status (2026). Graph Mamba and similar linear-attention methods are active research areas. For small-to-medium graphs (), GPS-style full attention is standard. For large graphs, MPNN-based methods (GraphSAGE, Cluster-GCN) remain dominant in production.
11. Training and Scaling GNNs
11.1 Full-Batch vs Mini-Batch Training
Full-batch training. Compute the GNN on the entire graph in each step. Requires the entire graph and all node features to fit in GPU memory. Practical for graphs with nodes. Exact gradient computation.
Mini-batch training. Sample a set of target nodes (a "mini-batch") and compute the GNN only on those nodes and their -hop neighborhoods. Much more memory-efficient but requires careful sampling to produce unbiased gradient estimates.
The challenge of mini-batch training for GNNs is the neighborhood explosion: if each node has degree , a -hop neighborhood has nodes. For , , this is 1000 nodes per target node - often larger than the desired mini-batch size.
Three approaches manage this explosion: neighbor sampling (11.2), graph partitioning (11.3), and subgraph sampling (11.4).
11.2 Neighbor Sampling
GraphSAGE neighbor sampling (Hamilton et al., 2017). For each target node at each layer, sample a fixed-size subset of neighbors:
- Layer 1 (nearest neighbors): sample neighbors
- Layer 2 (2-hop): sample neighbors of each sampled layer-1 neighbor
Total nodes per target: . Constant cost per target node, regardless of graph degree.
Variance reduction. Random sampling introduces variance in the gradient estimate. This can be reduced by importance sampling (weight each sampled neighbor by ) or by using multiple samples. In practice, - is sufficient for most tasks.
VR-GCN (Chen et al., 2018). Control variates method: maintain historical node embeddings as running averages. Use the historical embedding as a control variate for unsampled neighbors:
This gives an unbiased estimate of the full-batch gradient with lower variance than simple neighbor sampling.
11.3 Cluster-GCN
Cluster-GCN (Chiang et al., 2019) takes a graph partitioning approach: partition the graph into balanced clusters using METIS or other graph partitioning algorithms. A mini-batch consists of one or more clusters.
Key insight. Within a cluster, most edges are intra-cluster (partitioning algorithms minimize the cut). Running GCN on the induced subgraph of a cluster produces embeddings that see most relevant edges. Between-cluster edges (the cut) are ignored - this introduces approximation error, but the cut is small by design.
Memory: a cluster of size requires node features + the induced adjacency. For , memory is smaller than full-batch training.
Variance: the gradient approximation error comes from ignoring cross-cluster edges. To reduce this, Cluster-GCN uses multiple cluster sampling: each mini-batch consists of randomly selected clusters , retaining all edges within the union. This recovers some cross-cluster edges.
11.4 GraphSAINT: Subgraph Sampling
GraphSAINT (Zeng et al., 2020) frames mini-batch training as subgraph sampling: each mini-batch is a subgraph induced by a sampled node set .
Three sampling strategies:
- Node sampler: sample nodes uniformly; include all edges between sampled nodes. Simple but misses important edges.
- Edge sampler: sample edges uniformly; include all nodes incident to sampled edges. Better edge coverage.
- Random walk sampler: start random walks from random nodes; include all visited nodes (and their induced edges). Produces connected subgraphs with richer local structure.
Normalization. Since different edges are sampled with different probabilities, the gradient is biased without correction. GraphSAINT computes the sampling probability for each node and for each edge analytically (or by estimation), then reweights the loss:
This gives an unbiased gradient estimate and removes systematic bias from subgraph sampling.
Scaling. GraphSAINT successfully trains GNNs on Flickr (89K nodes, 900K edges), Reddit (232K nodes, 114M edges), and Yelp (716K nodes, 13M edges) - orders of magnitude larger than full-batch training can handle.
12. Applications in Machine Learning
12.1 Molecular Property Prediction and Drug Discovery
Molecular property prediction is the "killer application" of GNNs. A molecule is naturally a graph: atoms are nodes (with features: atomic number, hybridization, formal charge, chirality), and bonds are edges (with features: bond type, aromaticity, stereo configuration).
MPNN for quantum chemistry (Gilmer et al., 2017). Applied to the QM9 dataset (134K small organic molecules, 12 quantum chemical properties including dipole moment, polarizability, and HOMO-LUMO gap). The MPNN with edge-network message functions achieves state-of-the-art on 11 of 12 properties, demonstrating that learned graph representations outperform hand-engineered molecular descriptors.
SchNet (Schutt et al., 2017). A rotationally equivariant GNN for 3D molecular geometry. Nodes are atoms; edges connect atoms within a cutoff radius with edge features encoding pairwise distances. Uses continuous-filter convolutional layers: where is an MLP applied to interatomic distance. Achieves chemical accuracy on QM9 for most properties.
AlphaFold 2 (Jumper et al., 2021). The protein structure prediction breakthrough. Key GNN component: the Evoformer, which processes pairs of residues (edge features) and individual residues (node features) through an iterative process. The structure module then uses equivariant graph operations to predict 3D residue coordinates. AlphaFold 2 won the CASP14 protein structure prediction competition with accuracy on most targets - a problem open for 50 years.
Drug discovery pipeline. Modern computational drug discovery uses GNNs at every stage:
- Virtual screening: predict binding affinity between drug candidates (small molecules) and protein targets; GNNs on molecular graphs reduce wet-lab screening costs by
- ADMET prediction: predict absorption, distribution, metabolism, excretion, and toxicity properties from molecular structure
- De novo design: generative GNNs (junction tree VAE, graph diffusion models) design novel molecules with desired properties
12.2 Knowledge Graph Reasoning
A knowledge graph (KG) is a directed, heterogeneous graph where is a set of entities, is a set of relation types, and is a set of triples (subject, relation, object): e.g., (Einstein, bornIn, Germany), (Germany, locatedIn, Europe).
Triple completion (link prediction). Given an incomplete KG, predict missing triples: is (Einstein, worksWith, Bohr) a true triple? Key methods:
- TransE (Bordes et al., 2013): model relation as a translation in embedding space: for true triples . Simple and effective for 1-to-1 relations.
- RotatE (Sun et al., 2019): model relation as a rotation in complex space: where is element-wise complex multiplication. Captures symmetry, antisymmetry, and composition patterns.
R-GCN (Schlichtkrull et al., 2018). GNN for relational (heterogeneous) graphs. Separate weight matrix for each relation type:
where are the neighbors of through relation and is a normalization constant. Used for entity classification and relation prediction in Freebase and AIFB knowledge graphs.
For AI: In 2024, knowledge graphs power retrieval-augmented generation systems. Microsoft's Graph RAG (Edge et al., 2024) builds a knowledge graph from documents, then uses GNNs (and graph traversal) to answer complex multi-hop questions that flat vector retrieval cannot handle.
12.3 Recommendation Systems
Recommendation is a bipartite graph problem: users and items form nodes; interactions (clicks, purchases, ratings) form edges. The task is link prediction: predict which unobserved pairs represent true preferences.
LightGCN (He et al., 2020). Simplifies GCN for collaborative filtering by removing the feature transformation and nonlinearity, keeping only the graph smoothing:
Final embeddings: (JK-style layer combination). Prediction: .
LightGCN outperforms standard GCN for collaborative filtering, suggesting that the feature transformation is unnecessary (or harmful) when the only input feature is an ID embedding. The key contribution is the propagation of user-item signals through multi-hop paths.
PinSage (already discussed in 5.5) extends GraphSAGE to the bipartite user-pin graph with importance-based sampling and hard negative mining.
12.4 Code and Program Analysis
Source code can be represented as multiple graphs simultaneously:
- AST (Abstract Syntax Tree): hierarchical tree showing syntactic structure
- CFG (Control Flow Graph): nodes are basic blocks; edges show execution flow (if/else, loops)
- DFG (Data Flow Graph): edges connect variable definitions to their uses
- Call Graph: nodes are functions; edges connect callers to callees
code2vec (Alon et al., 2019). Represents code snippets as bags of AST paths; learns path embeddings and aggregates them for downstream tasks. While not a full GNN, it demonstrates the power of structural code representations.
GNNs for bug detection. Allamanis et al. (2018) build heterogeneous graphs from Python/C code with AST, data flow, and control flow edges; train a GNN to classify whether a variable name is misused. Achieves high precision on detecting certain classes of bugs.
Program synthesis. GNNs operating on execution traces (program state as graph) guide neural program search. DeepCoder (Balog et al., 2017) and more recent systems use graph representations of input-output examples to synthesize programs.
12.5 LLM Integration: Graph RAG and Structure-Aware Language Models
The 2024-2026 frontier is combining GNNs with large language models - exploiting the complementary strengths of relational structure (GNNs) and natural language understanding (LLMs).
Graph RAG (Edge et al., 2024, Microsoft Research). Standard RAG (retrieval-augmented generation) retrieves flat text chunks. Graph RAG builds a knowledge graph from documents and uses graph community detection (Leiden algorithm) to generate hierarchical summaries. At query time, relevant communities are retrieved and their summaries are used to ground the LLM response.
Key insight: complex questions requiring synthesis across many documents (e.g., "What are the main themes in this corpus?") are better answered by traversing a knowledge graph than by retrieving isolated text chunks. Graph RAG achieves % win rate over naive RAG on these "global sensemaking" queries.
GNN+LLM for molecular generation. LLMs can generate SMILES strings (text representations of molecules), but they struggle with structural validity constraints (valence, aromaticity). GNN-based validity checkers or graph-structured decoders can enforce these constraints.
Graph-structured memory for agents. Recent AI agents (2025) maintain working memory as knowledge graphs updated with each observation. A GNN processes the graph state to inform action selection. This enables multi-hop reasoning ("I know A->B->C, so if I see C, I should infer A") that flat token sequences handle poorly.
13. Common Mistakes
| # | Mistake | Why It's Wrong | Fix |
|---|---|---|---|
| 1 | Using mean aggregation for graph classification when graph size varies | Mean normalizes by node count - two graphs with identical local structure but different sizes get identical embeddings, despite being different | Use sum aggregation (which preserves size information) or pair with a size feature |
| 2 | Forgetting self-loops in GCN | Without self-loops (), a node does not include its own features in aggregation - it only receives neighbors' information, losing its own signal | Always add to before normalization; this is the renormalization trick from Kipf & Welling |
| 3 | Adding too many GCN layers and blaming model capacity | Deep GCNs fail due to over-smoothing, not lack of capacity - adding parameters won't help | Use 2-4 layers; add residual connections (GCNII), DropEdge, or PairNorm if more depth is needed |
| 4 | Treating GAT attention weights as feature importance | Attention weights tell you which neighbors were weighted more, not which features were important - this is the "attention is not explanation" problem | Use gradient-based attribution (GradCAM, Integrated Gradients) for feature importance; treat attention as architectural choice, not explanation |
| 5 | Using GIN with mean aggregation | GIN's theoretical power (matching 1-WL) requires sum aggregation with an injective MLP. Mean aggregation in GIN is worse than GCN in expressiveness | Always use sum aggregation in GIN; verify that the MLP has sufficient depth (\geq2 layers) |
| 6 | Forgetting that 1-WL cannot detect triangles | Any MPNN is bounded by 1-WL; 1-WL cannot count triangles or cliques. If your task requires triangle detection (e.g., social network analysis), a standard GNN will fail | Add structural features (triangle count, RWSE) or use a higher-order GNN (NGNN, subgraph GNN) |
| 7 | Normalizing node features but not edge features | Unnormalized edge features with large variance can dominate the attention scores in GAT or the message in MPNN | Normalize edge features to zero mean, unit variance; use LayerNorm before message computation |
| 8 | Using transductive GCN for inductive tasks | GCN's normalized adjacency is computed on the training graph; new nodes at test time require recomputing the entire adjacency and rerunning the network | Use inductive methods (GraphSAGE, GAT) that learn aggregation functions, not fixed propagation matrices |
| 9 | Applying global pooling before sufficient local aggregation | With only 1 GNN layer before readout, each node only knows its immediate neighbors; graph-level representations lack structural context | Use 3-5 GNN layers before readout; consider hierarchical pooling (DiffPool) for hierarchically structured graphs |
| 10 | Ignoring the over-squashing problem for long-range tasks | If the task requires integrating information from nodes far apart, a shallow MPNN will fail due to over-squashing even without over-smoothing | Use graph rewiring (DIGL, EGP) or graph transformers (GPS, Graphormer) for long-range tasks; measure effective receptive field |
| 11 | Training GNN on the test graph for transductive semi-supervised learning | Using the test graph topology during training is allowed (transductive setting), but using test node labels is a data leakage error | Only mask the labels of test nodes; the full graph adjacency is legitimately used during both training and testing in the transductive setting |
| 12 | Confusing graph-level and node-level tasks in the readout | Using node embeddings directly for graph classification (without pooling) produces an embedding for each node, not for the graph - dimensions won't match | Always apply a readout function (sum/mean/attention pooling) after the final GNN layer for graph-level tasks |
14. Exercises
Exercise 1 * - MPNN Implementation
Implement a 2-layer MPNN on a small graph and compute node representations from scratch.
(a) Construct the adjacency matrix for the graph: nodes , edges .
(b) Initialize node features (identity matrix - each node has a one-hot representation).
(c) Implement one GCN layer: where , , and is a random matrix (fixed seed).
(d) Apply a second GCN layer to get .
(e) Verify that permuting the nodes (applying a permutation to rows of and both rows and columns of ) produces permuted outputs .
Exercise 2 * - GCN Propagation Matrix
Analyze the spectral properties of the GCN propagation matrix .
(a) For a path graph (5 nodes in a line), compute and find its eigenvalues.
(b) Show that all eigenvalues of lie in by relating to the normalized Laplacian and using the PSD property of .
(c) Compute for and visualize the row norms as a function of . Observe over-smoothing: the rows converge.
(d) Show that the limiting matrix has all rows proportional to , the stationary distribution. Compute for .
Exercise 3 * - Aggregation Expressiveness
Demonstrate that sum aggregation is strictly more expressive than mean for multisets.
(a) Construct two multisets and of node features (scalar, value 1). Compute sum(), mean(), sum(), mean(). Verify that sum distinguishes them but mean does not.
(b) Construct two multisets and . Show that max() = max() = 2, but sum() sum().
(c) For two graphs (star with 3 leaves) and (path of 4 nodes), show that a GCN with mean aggregation assigns identical representations to some nodes, but a GIN with sum aggregation does not.
(d) Implement the WL color refinement algorithm for and with uniform initial colors. How many iterations until stable? Does WL distinguish from ?
Exercise 4 ** - Weisfeiler-Leman Test
Implement the 1-WL algorithm and apply it to pairs of graphs.
(a) Implement wl_isomorphism_test(G1, G2, max_iter=10): run 1-WL color refinement on both graphs simultaneously; return "MAYBE_ISOMORPHIC" if the final color histograms match, or "NOT_ISOMORPHIC" at the first iteration where they differ.
(b) Test on (6-cycle) and (complete bipartite, 6 nodes, 9 edges). What does WL conclude? Are they actually isomorphic?
(c) Construct two non-isomorphic 3-regular graphs on 6 nodes ( and the prism graph ). Does 1-WL distinguish them? (Hint: both are 3-regular, so the degree-based first iteration is identical.)
(d) Explain why any MPNN with mean aggregation would fail to distinguish these graphs, while GIN with sum aggregation would succeed (or also fail, if WL itself fails).
Exercise 5 ** - GAT Attention Mechanism
Implement a single GAT attention head from scratch.
(a) For a graph with 5 nodes and edge set , compute attention coefficients for all edges.
(b) Use , , random and (fixed seed). Initialize node features randomly.
(c) Compute the attention logits with LeakyReLU (slope 0.2) for all edges, then apply softmax over each node's neighborhood.
(d) Show the difference between GAT and GATv2: compute GATv2 attention scores and verify that the neighbor ranking for node 0 can differ between GAT and GATv2.
(e) Compute the updated node representations .
Exercise 6 ** - Over-Smoothing Dynamics
Quantify over-smoothing as a function of GCN depth.
(a) Construct a Cora-like graph: generate a stochastic block model with 200 nodes, 4 blocks, within-block edge probability , between-block probability .
(b) Initialize node features as class one-hot vectors ( class assignments as a matrix).
(c) Apply the GCN propagation (no weight matrix, no nonlinearity - pure smoothing) for steps. Compute the Dirichlet energy at each depth.
(d) Plot vs on a log scale. Fit an exponential and estimate .
(e) Compare with the theoretical rate: where is the Fiedler value of the random-walk Laplacian. Compute numerically and compare.
Exercise 7 *** - GIN Implementation and Expressiveness
Implement GIN and compare its expressiveness to GCN on graph-level classification.
(a) Implement a 3-layer GIN with sum aggregation and 2-layer MLP update: with .
(b) Construct two non-isomorphic graphs and (as computed in 7.2, WL cannot distinguish them without initial node features). Initialize all nodes with the same feature vector .
(c) Show that with uniform initial features, neither GIN nor GCN can distinguish and (both produce identical graph-level sum representations). Why?
(d) Now add degree as an initial node feature: . Rerun both GIN and GCN. Does GIN now distinguish the two graphs? Does GCN?
(e) Generate a dataset of 500 random graphs (mix of Erdos-Renyi and stochastic block models); train GIN and GCN for graph-level binary classification. Report test accuracy. Verify that GIN \geq GCN accuracy.
Exercise 8 *** - Graph Transformer with Positional Encoding
Implement a simplified graph transformer layer with Laplacian positional encoding.
(a) For a graph with nodes (Barabasi-Albert model with ), compute the normalized Laplacian and extract the first non-trivial eigenvectors .
(b) Initialize node features randomly and augment with LapPE: .
(c) Implement one layer of scaled dot-product self-attention over all 20 nodes (fully connected, ignoring graph edges): .
(d) Implement one GCN layer on the same graph. Visualize the attention matrices of both: which nodes attend to which? Does the graph transformer's attention recover some of the graph structure (do nearby nodes attend to each other more)?
(e) Compare: run 3 iterations of graph transformer attention vs 3 layers of GCN on the same random features. Compute the Dirichlet energy after each: which decreases faster? What does this reveal about over-smoothing in graph transformers vs GNNs?
15. Why This Matters for AI (2026 Perspective)
| Concept | AI Impact |
|---|---|
| MPNN framework | The unifying abstraction behind AlphaFold 2 (protein structure), drug screening, and materials design GNNs. Every new spatial GNN architecture published in 2024-2026 is a special case |
| GCN propagation rule | Directly computable from sparse adjacency; used in LightGCN powering Netflix/Pinterest/TikTok recommendation at billion-node scale |
| GraphSAGE inductive learning | Enables daily-updated embeddings for dynamic graphs (social networks, e-commerce) without retraining the full model |
| GAT / GATv2 | Learned, sparse attention patterns over structured data; used in molecular GNNs for drug-target interaction prediction (AstraZeneca, Recursion Pharmaceuticals) |
| WL expressiveness theorem | Theoretical bound establishing what structure any MPNN can detect; determines when to add structural features or upgrade to subgraph GNNs; guides architecture search |
| GIN sum aggregation | The foundation for most molecular property prediction models; identifies that all mainstream GCN deployments are sub-optimal in expressiveness |
| Over-smoothing analysis | Explains why 2-layer GNNs outperform 8-layer GNNs in most production deployments; directly informs depth tuning decisions |
| Over-squashing and graph rewiring | Motivates graph preprocessing in biochemistry (adding virtual bonds between distant atoms) and knowledge graph reasoning (adding transitive edges) |
| Graph Transformers (GPS, Graphormer) | State-of-the-art on quantum chemistry benchmarks (QM9, OGB-LSC); powering the next generation of molecular AI models |
| LapPE and RWSE | Standard input features for all graph foundation models; analogous to positional encodings in LLMs - without them, graph models cannot distinguish relative node positions |
| Graph RAG | Microsoft's production system (Azure AI Search) for document understanding queries requiring multi-hop reasoning; deployed across Office 365 and Microsoft 365 Copilot |
| Hierarchical pooling (DiffPool, MinCutPool) | Used in drug candidate scoring for proteins and polymer graphs where hierarchical structure (residues -> domains -> whole protein) is essential |
| Neighbor sampling (GraphSAGE, Cluster-GCN) | The key to training GNNs on billion-node graphs; enables GPU training without the full graph fitting in memory |
16. Conceptual Bridge
Looking Backward: What This Section Builds On
Graph Neural Networks are the synthesis of all earlier mathematics in this curriculum. From Chapter 2 Linear Algebra, we use matrix multiplication (the propagation rule ), eigenvectors (the spectral derivation of GCN), and norms (Dirichlet energy). From Chapter 3 Advanced Linear Algebra, we use eigendecomposition (Laplacian positional encodings), PSD theory (the Laplacian is PSD, proven in 11-04), and the spectral theorem (graph Fourier transform). From Chapter 4 Calculus, we use gradients and the chain rule for backpropagation through GNN layers. From Chapter 8 Optimization, we use SGD variants and the mini-batch training strategies of 11. From Chapter 11 Graph Theory itself, we have built on 11-04's spectral theory (the GCN derivation via Chebyshev polynomials, over-smoothing as diffusion, Laplacian eigenmaps as positional encodings) and 11-03's algorithms (BFS neighborhoods as the receptive field, max-flow as the analogy to Cheeger's inequality).
Looking Forward: What This Section Enables
Chapter 12 Functional Analysis generalizes the spectral ideas of GNNs to infinite-dimensional Hilbert spaces. The graph Fourier transform is the finite-dimensional analogue of the Fourier transform on ; the Laplacian eigenvectors are the finite-dimensional analogue of Fourier modes. Functional analysis provides the rigorous framework for this extension, enabling kernel methods on graphs and the theoretical analysis of continuous-limit GNNs (graphons).
Chapter 14 Math for Specific Models will revisit GNN architectures in the context of equivariant neural networks - networks that respect symmetries (rotation, reflection, permutation) by design. Geometric GNNs for 3D molecular data (DimeNet, SE(3)-Transformers, NequIP) extend the MPNN framework with -equivariant message functions, enabling prediction of orientation-dependent molecular properties.
Chapter 21 Statistical Learning Theory will study GNN generalization bounds: how many training graphs are needed for a GNN to generalize? The WL expressiveness hierarchy (7) directly informs these bounds - more expressive GNNs require more data to generalize.
Chapter 22 Causal Inference uses directed acyclic graphs (DAGs) - a special class of directed graphs studied in 11-01. The graph algorithms of 11-03 (topological sort for causal ordering) and the GNN architectures of 11-05 (for learning on causal graphs) connect these chapters.
GRAPH NEURAL NETWORKS - POSITION IN THE CURRICULUM
========================================================================
Ch.02-03: Linear Algebra, Eigenvalues, SVD
down
Ch.11-04: Spectral Graph Theory
| [Laplacian, GCN spectral derivation,
| over-smoothing preview, LapPE]
down
+==========================================+
| Ch.11-05: Graph Neural Networks | <- YOU ARE HERE
| |
| MPNN -> GCN -> GraphSAGE -> GAT -> GIN |
| WL Expressiveness -> Over-smoothing |
| Graph Transformers -> GPS/Graphormer |
| Applications: AlphaFold 2, Graph RAG |
+==========================================+
down down
Ch.11-06: Ch.12: Functional Analysis
Random Graphs [Graph limits (graphons),
[Graph models kernel methods on graphs,
for GNN infinite-dimensional
benchmarks] spectral theory]
down
Ch.14: Math for Specific Models
[Equivariant GNNs, SE(3)-Transformers,
3D molecular AI, geometric deep learning]
down
Ch.21: Statistical Learning Theory
[GNN generalization bounds,
WL-based complexity, PAC learning on graphs]
========================================================================
<- Back to Graph Theory | Previous: Spectral Graph Theory <- | Next: Random Graphs ->
Appendix A: Mathematical Derivations
A.1 Proof that the GCN Propagation Rule is Permutation Equivariant
Claim. The GCN layer is permutation equivariant: for any permutation matrix ,
Proof. Let . The degree matrix of is:
Wait - let me be precise. Since is a permutation, where is the permutation. Thus:
So , and:
Therefore:
The last step uses the fact that is applied element-wise and permutes rows: for any matrix and element-wise .
A.2 Universal Approximation of Injective Multiset Functions
The theoretical foundation of GIN rests on the following characterization:
Theorem (Xu et al. 2019, following Zaheer et al. 2017). Let be a countable set. A function on the space of finite multisets over is injective if and only if there exists and such that:
Proof sketch (sufficiency). Assume is countable: enumerate as . A multiset over is characterized by its multiplicity function where is the number of times appears. Choose (unique real value per element). Then:
This is an injective mapping from the multiplicity function to (by the uniqueness of representations in base , this holds for bounded multiplicities). Then can be any function that recovers from this sum.
Necessity. If is injective and we use sum aggregation , then different multisets must map to different sums. The existence of and satisfying this is guaranteed by injectivity and the separating power of sums over countable sets.
Consequence for GIN. With a sufficiently expressive (a deep MLP, by universal approximation) and (another deep MLP), GIN can represent any injective function on multisets of node features. This gives GIN the maximum discriminative power achievable by any MPNN.
A.3 Dirichlet Energy Decay Rate
Theorem. Let be the GCN propagation matrix with . For any node feature matrix :
where and is the normalized Laplacian of .
Proof. Write the eigendecomposition where is orthonormal and with (since is doubly stochastic after normalization). Then:
Since :
The eigenvalues of are . For (which holds since is a non-negative symmetric matrix with row sums at most 1 after self-loop normalization), we have .
Therefore: .
For the GCN with self-loops, (strictly), so exponentially fast.
A.4 Jacobian Bound for Over-Squashing
Theorem (Alon & Yahav, 2021). For a GNN with layers and bounded weight matrices and Lipschitz activation :
where is a constant depending on the architecture.
Key observation. The entry counts the weighted number of walks of length from to . For nodes and separated by a bottleneck edge (edge with large removal betweenness centrality), all walks from to must pass through . The bottleneck effect:
If node has degree , then at most of the walks that reach proceed to (in the normalized walk). Thus , which decays exponentially in when is large.
Practical consequence. Nodes separated by a high-degree hub (common in scale-free social networks) have near-zero Jacobian - the GNN cannot propagate useful gradient signal between them regardless of depth.
A.5 Graph Isomorphism Network: Full Algorithm
Algorithm: GIN for Graph Classification
Input: Graph with nodes Output: Graph embedding
-
Initialize: for all
-
For :
- For each node :
- Apply batch normalization to
-
Readout: for each layer , compute layer-specific graph embedding:
-
Concatenate:
-
Apply MLP classifier:
Why concatenate all layers? Each layer captures patterns at a different structural scale: layer 0 is node features; layer 1 is immediate neighborhood; layer is the -hop subgraph structure. Concatenating all layers allows the classifier to use patterns at all scales simultaneously, matching the JK-Net (Jumping Knowledge Networks) readout.
Choice of . In practice, works well - the MLP is expressive enough to learn the correct weighting. Learned adds one scalar parameter per layer and sometimes improves performance on sparse graphs where the self-feature has a very different scale from the aggregated neighborhood.
A.6 GPS Layer: Formal Specification
The GPS (General, Powerful, Scalable) Graph Transformer layer (Rampasek et al., 2022) is specified as follows.
Input: Node embeddings , adjacency , positional encodings
MPNN sub-layer. For each node :
The MPNN can be any of: GCN, GINE, GAT, GATv2. GINE (GIN with Edge features) is:
Transformer sub-layer. Standard multi-head self-attention over all nodes:
Feed-forward + LayerNorm:
where is a 2-layer MLP.
Complexity: for the MPNN sub-layer (sparse), for the Transformer sub-layer (dense). Total: . For sparse graphs where , the Transformer dominates.
Practical approximation. For large graphs (), replace the full Transformer with a linear attention approximation (Performer, Longformer-style local+global attention, or Nystromformer). This reduces Transformer cost to , making GPS scalable to large graphs while retaining the long-range attention benefit.
Appendix B: Implementation Notes
B.1 GCN in Matrix Form: Step-by-Step
Given a graph with adjacency matrix , node features , and weight matrices , :
Step 1: Compute A = A + I_n
Step 2: Compute D = diag(A 1) = diag of row sums
Step 3: Compute D^{-1/2}: take 1/sqrt of diagonal entries
Step 4: Compute A = D^{-1/2} A D^{-1/2} (sparse: O(m) entries)
Step 5: H^1 = ReLU(A X W^0) - shape: (n, d_1)
Step 6: H^2 = softmax(A H^1 W^1) - shape: (n, d_out) [for classification]
Step 7: Loss = -sum_{v in V_L} sum_c Y_{vc} log H^2_{vc}
Step 8: Backpropagate through H^2, H^1, W^1, W^0
Sparsity. for real-world graphs is extremely sparse (average degree 10 out of nodes means fill). All operations involving should use sparse matrix formats (CSR, COO). The computation for sparse and dense costs , not .
B.2 Edge List Representation and Scatter Operations
In practice, GNNs are implemented using an edge list representation rather than adjacency matrices. The graph is stored as two arrays:
edge_index: shape (2, m) - edge_index[0] = source nodes, edge_index[1] = target nodes
edge_attr: shape (m, d_e) - edge features
x: shape (n, d_v) - node features
The message-passing computation is implemented as:
# Gather source node features for all edges
messages = M(x[edge_index[0]], edge_attr) # shape: (m, d')
# Scatter messages to target nodes (sum aggregation)
agg = torch.zeros(n, d').scatter_add(0,
edge_index[1].unsqueeze(-1).expand(-1, d'),
messages) # shape: (n, d')
This scatter_add operation (implemented in PyTorch Geometric and DGL) is the fundamental primitive of GPU-accelerated GNN training. It is equivalent to a sparse matrix-vector product but more flexible (supports arbitrary message functions).
B.3 Mini-Batch Neighbor Sampling Implementation
GraphSAGE Training Step:
========================
1. Sample batch of target nodes: B \subseteq V, |B| = B_size
2. For l = L, L-1, ..., 1:
- For each node v in current frontier:
- Sample S_l neighbors: N_l(v) \subseteq N(v), |N_l(v)| = S_l
- Current frontier = B \cup 1-hop(B) \cup ... \cup l-hop(B) [sampled]
3. Extract induced subgraph of all frontier nodes
4. Run GNN on induced subgraph (full-batch over small subgraph)
5. Compute loss only on target nodes B
6. Backpropagate
Total nodes per training step: . For , , : nodes per step.
B.4 Practical GNN Hyperparameter Guide
| Hyperparameter | Typical Range | Notes |
|---|---|---|
| Number of layers | 2-4 | Start with 2; increase only if task requires long-range |
| Hidden dimension | 64-512 | Match to dataset size; 256 is a strong default |
| Dropout rate | 0.0-0.5 | 0.3-0.5 on feature/edge dropout helps regularize |
| Aggregation | Sum (GIN), Mean (GCN) | Sum for classification; mean for regression |
| Activation | ReLU, PReLU | PReLU slightly better on sparse graphs |
| Batch normalization | After each layer | Essential for deep GNNs (4+ layers) |
| Learning rate | 0.001-0.01 | Adam optimizer; cosine decay schedule |
| Neighbor sample sizes | or | Larger improves accuracy; diminishing returns |
| Weight decay | - | L2 regularization on weight matrices |
| Number of attention heads | 4-8 | For GAT; each head uses features |
Appendix C: Connections to Classical Machine Learning
C.1 GNNs and Kernel Methods
The GNN's computation can be viewed through the lens of kernel methods. Define the graph neural tangent kernel (GNTK) as the limiting kernel of an infinitely-wide GNN at initialization:
This connects GNNs to the NTK theory of Jacot et al. (2018) and provides a framework for analyzing GNN training dynamics. For graph classification:
- Wide GNN kernel SVM: at infinite width, a GNN with random feature initialization and gradient descent is equivalent to a kernel machine with the GNTK
- GNTK expressiveness: the GNTK of a GIN-like architecture is exactly as expressive as 1-WL (matching the finite-depth GNN result)
Graph Weisfeiler-Leman kernel (Shervashidze et al., 2011). Before GNNs, graph kernels were the primary ML method for graphs. The WL kernel computes a feature vector from the color histograms at each WL refinement step:
where is the histogram of colors at WL iteration . The WL kernel is equivalent to a GIN with fixed (non-learned) MLP - it is the non-parametric version of GIN.
For AI: Understanding GNNs as kernel machines allows the use of kernel theory (generalization bounds, bandwidth selection, representer theorem) to analyze GNN behavior - important for the theoretical foundations in Chapter 12 (Functional Analysis) and Chapter 21 (Statistical Learning Theory).
C.2 GNNs and Belief Propagation
Loopy belief propagation (LBP) on probabilistic graphical models (Markov random fields, factor graphs) has exactly the same computational structure as MPNN message passing.
Belief propagation update:
MPNN message update:
The analogy:
- BP messages <-> GNN messages
- BP potential <-> GNN message function
- BP product of incoming messages <-> GNN sum/product aggregation
Implication: GNNs can be viewed as learned approximations to belief propagation. Where BP computes exact marginals on trees (and approximate marginals on loopy graphs), GNNs learn to compute useful representations that may or may not correspond to probabilistic marginals. For tasks that arise from probabilistic inference on graphs (decoding, error correction, satisfiability), architectures inspired by BP (like LDPC decoders implemented as GNNs) achieve state-of-the-art results.
C.3 Label Propagation vs GCN
Label propagation (Zhu et al., 2003) is a classical semi-supervised learning method that also propagates information across graph edges. The update rule:
where , is the label matrix (soft predictions), is the observed labels (zero for unlabeled nodes), and controls the balance between propagation and fitting observed labels.
At convergence: - a closed-form solution.
GCN vs label propagation:
- Label propagation: linear, closed-form, uses only graph structure (no node features)
- GCN: nonlinear, gradient-based, uses both node features and graph structure
- For feature-rich graphs: GCN consistently outperforms LP
- For feature-poor graphs (only node IDs): LP is competitive and much cheaper
- APPNP (Gasteiger et al., 2019): decouples the GNN feature transformation from the propagation step, applying propagation steps as a post-processing operation similar to LP. Achieves better performance than GCN on many node classification benchmarks by using more propagation steps without over-smoothing (due to the mixing coefficient).
C.4 Random Walk Methods and GNNs
Before GNNs, random walk methods (DeepWalk, Node2Vec) were the standard for learning node embeddings.
DeepWalk (Perozzi et al., 2014). Sample random walks from each node; treat the walk as a "sentence" and apply Skip-Gram (Word2Vec) to learn node embeddings that predict context nodes in the walk.
Node2Vec (Grover & Leskovec, 2016). Biased random walks with parameters (return probability) and (in-out probability) to balance BFS-like (homophily) and DFS-like (structural equivalence) exploration.
Connection to GNNs. The Skip-Gram objective for DeepWalk is equivalent to factorizing the matrix:
where is the -step transition matrix. This is related to the graph's random walk kernel.
GNNs can incorporate random walk information through RWSE (7.6) or through the unsupervised pretraining objective of GraphSAGE (5.4). Modern practice often combines: pretrain with random walk objectives (fast, scalable), then fine-tune with GNN layers (expressive, task-specific).
C.5 Over-smoothing and Spectral Perspective
From the spectral perspective of 11-04, over-smoothing has a clean frequency interpretation. A graph signal can be decomposed into its graph Fourier components (where is the -th Laplacian eigenvector). The GCN propagation acts as a low-pass filter:
where are the eigenvalues of . After layers:
- Lowest frequency (, ): constant component, amplified by factor 1 - perfectly preserved
- Low frequencies (): slowly varying signals, amplified by - nearly preserved
- High frequencies (): rapidly oscillating signals, suppressed by - exponentially suppressed
The class label signal is typically a low-frequency signal (connected nodes tend to have the same class in homophilic graphs). But even the label signal is eventually suppressed by many propagation steps - after enough layers, even the zero-frequency constant component dominates and all nodes collapse.
Optimal depth. The optimal number of layers balances two effects:
- More layers: larger receptive field, more node context, better long-range information
- More layers: stronger smoothing, loss of high-frequency discriminative features
For a graph with Fiedler value , the optimal depth is approximately (the mixing time of the random walk). For clustered graphs with small (e.g., ): layers could be used. For expander-like graphs with large (e.g., ): layers is optimal.
Appendix D: GNN Training Recipes and Engineering Tricks
D.1 Data Preprocessing for Graph ML
Node feature normalization. Unlike image pixels (naturally in ), node features in graphs span wildly different scales. Atom features in molecular graphs: atomic number (1-118), formal charge ( to ), number of hydrogens (-). Normalizing to zero mean, unit variance per feature dimension is essential.
Edge feature preprocessing. Bond lengths in molecules range from 1.0 A (triple bond) to 1.5 A (single bond). If used as raw edge features, their small absolute range means they have little influence relative to bond type one-hot features. Apply normalization or use learned embeddings.
Graph-level normalization. For graph regression tasks (predicting molecular energy, which scales with the number of atoms), normalize the target by dividing by (atoms count). Otherwise, larger molecules trivially have larger predicted values.
Degree-based normalization. When using sum aggregation, node representations can grow with degree. Apply degree normalization as part of preprocessing: scale node features by before input, matching the GCN normalization.
D.2 Positional Encodings: Practical Considerations
LapPE sign ambiguity. Eigenvectors are defined up to sign: if is an eigenvector of , so is . Naive use of eigenvectors as PEs means two identical graphs with differently-signed PEs produce different GNN outputs - violating permutation invariance over graph orientation.
Fix (Lim et al., 2022): during training, randomly flip the sign of each eigenvector independently at each mini-batch. This forces the GNN to be invariant to sign choice. At inference, use a consistent sign (e.g., always choose the sign such that the first non-zero element is positive).
RWSE efficiency. Computing for requires matrix powers - expensive for large graphs. Efficient computation: start a random walk from node ; estimate the return probability by averaging over random walkers:
With walkers per node, this gives accurate RWSE estimates in time.
Eigenvalue degeneracy. If two eigenvalues of are equal (), the corresponding eigenspace is two-dimensional and any orthonormal basis for it is valid. This means LapPE is non-unique for graphs with symmetry (e.g., complete graphs, regular graphs). Degenerate eigenvectors encode no structural information - they should either be excluded from PEs or handled with random rotation augmentation.
D.3 Loss Functions for Graph Tasks
Node classification. Cross-entropy over labeled nodes:
For class-imbalanced graphs (e.g., rare fraud nodes), use focal loss or weighted cross-entropy.
Link prediction. Binary cross-entropy with negative sampling:
Negative sampling ratio (how many non-edges per positive edge): 1:1 to 1:5 is standard. Hard negative mining (sampling non-edges that are structurally close to positive edges) improves embedding quality at the cost of more complex sampling.
Graph regression. Mean absolute error (MAE) for quantum chemistry (physically meaningful, outlier-robust) or mean squared error (MSE) for property prediction where large errors are particularly bad. Normalize targets per-dataset: for training, inverse-transform predictions.
Contrastive learning on graphs. Self-supervised GNNs (GraphCL, SimGRACE) maximize agreement between augmented views of the same graph:
where is cosine similarity, is temperature, and is the embedding of an augmented view (e.g., edge dropout, node feature masking). Graph augmentations must preserve semantic meaning - dropping a bond in a molecule changes its identity, so augmentation must be domain-aware.
D.4 Debugging GNNs in Practice
Symptom: GNN performs worse than MLP on node features alone. Cause: the graph structure is not informative for the task (e.g., low homophily), or the GNN is over-smoothing and destroying the discriminative node features. Fix: verify homophily ratio ; if , graph structure hurts. Use heterophilic GNNs (FAGCN, H2GCN) that can handle graphs where connected nodes have different labels.
Symptom: Training loss decreases but validation loss explodes. Cause: overfitting, likely due to GNN memorizing the training graph topology rather than learning transferable patterns. Fix: add dropout on node features (not just edges), reduce model capacity, add weight decay. For transductive semi-supervised learning, ensure the train/val split is performed correctly.
Symptom: All node representations collapse to the same vector. Cause: over-smoothing (too many layers) or numerical instability (exploding/vanishing gradients through ). Fix: check Dirichlet energy after each layer; use PairNorm or GCNII; add gradient clipping.
Symptom: GAT attention weights are all uniform (). Cause: the attention vector has collapsed - the model learned that uniform attention is safest. Often occurs with high learning rates or without proper initialization. Fix: initialize with Xavier initialization; use a smaller learning rate for attention parameters; check for gradient vanishing in LeakyReLU (use a larger negative slope, e.g., ).
Symptom: Training is extremely slow. Cause: running GNN on the full large graph every step (full-batch training). Fix: switch to mini-batch training with neighbor sampling (GraphSAGE), graph partitioning (Cluster-GCN), or subgraph sampling (GraphSAINT). Profile GPU utilization - if below 70%, the bottleneck is data loading, not computation.
Appendix E: Benchmark Datasets and Evaluation
E.1 Standard Node Classification Benchmarks
| Dataset | Nodes | Edges | Features | Classes | Homophily | Task |
|---|---|---|---|---|---|---|
| Cora | 2,708 | 5,429 | 1,433 | 7 | 0.81 | Transductive node clf |
| CiteSeer | 3,327 | 4,732 | 3,703 | 6 | 0.74 | Transductive node clf |
| PubMed | 19,717 | 44,338 | 500 | 3 | 0.80 | Transductive node clf |
| Amazon-Photo | 7,650 | 119,081 | 745 | 8 | 0.83 | Transductive node clf |
| Chameleon | 2,277 | 36,101 | 2,325 | 5 | 0.23 | Heterophilic node clf |
| Squirrel | 5,201 | 217,073 | 2,089 | 5 | 0.22 | Heterophilic node clf |
| OGB-arxiv | 169,343 | 1,166,243 | 128 | 40 | 0.65 | Large-scale node clf |
| OGB-products | 2,449,029 | 61,859,140 | 100 | 47 | 0.81 | Large-scale node clf |
Evaluation protocol. Standard split for Cora/CiteSeer/PubMed: 20 nodes per class for training, 500 for validation, 1000 for test (following Kipf & Welling, 2017). OGB uses official dataset-specific splits with multiple random seeds for reliable comparison.
E.2 Standard Graph Classification Benchmarks
| Dataset | Graphs | Avg. nodes | Avg. edges | Classes | Domain |
|---|---|---|---|---|---|
| MUTAG | 188 | 17.9 | 19.8 | 2 | Molecular (mutagenicity) |
| PROTEINS | 1,113 | 39.1 | 72.8 | 2 | Protein function |
| IMDB-B | 1,000 | 19.8 | 96.5 | 2 | Social (movie collaboration) |
| COLLAB | 5,000 | 74.5 | 2,457.8 | 3 | Social (research collaboration) |
| OGB-molhiv | 41,127 | 25.5 | 27.5 | 2 | Molecular (HIV inhibition) |
| OGB-molpcba | 437,929 | 26.0 | 28.1 | 128 | Molecular (bioassays) |
| QM9 | 130,831 | 18.0 | 18.6 | - | Molecular (12 quantum properties) |
Evaluation. Graph classification: 10-fold cross-validation accuracy for TU datasets; ROC-AUC for OGB. Molecular regression (QM9): mean absolute error (MAE) per property, target unit in Angstroms, eV, Debye (domain-dependent).
E.3 Long-Range Graph Benchmarks (LRGB)
Introduced by Dwivedi et al. (2022) to specifically test long-range dependency learning - tasks where local GNNs fail but graph transformers succeed.
| Dataset | Nodes | Task | Long-range property |
|---|---|---|---|
| Peptides-func | 150K graphs, avg 150 nodes | Graph multi-label clf | Functional class depends on full peptide structure |
| Peptides-struct | 150K graphs, avg 150 nodes | Graph regression | 3D structural properties (end-to-end distances) |
| PCQM-Contact | 529K graphs | Link prediction | Molecular contact maps (spatial proximity) |
| COCO-SP | 123K graphs, avg 476 nodes | Node segmentation | Pixel graphs; semantic context is long-range |
GNN vs Graph Transformer gap. On Peptides-func (AP metric):
- GCN: 0.5930
- GIN: 0.6621
- GPS (GINE + Transformer): 0.7470
- GPS + RWSE: 0.7538
The gap between GIN and GPS demonstrates the clear advantage of global attention for long-range tasks.
E.4 Choosing the Right Architecture
ARCHITECTURE SELECTION GUIDE
========================================================================
Task requires long-range dependencies?
+-- YES -> Graph Transformer (GPS, Graphormer)
| or Graph rewiring + MPNN
+-- NO down
Graph has millions of nodes?
+-- YES -> GraphSAGE (inductive) or Cluster-GCN/GraphSAINT (large)
+-- NO down
Need to classify entire graphs?
+-- YES -> GIN + sum pooling (expressiveness matters)
| or DiffPool/SAGPool for hierarchical structure
+-- NO down
Need interpretable edge importance?
+-- YES -> GAT or GATv2
+-- NO down
Simple strong baseline needed first?
+-- -> 2-layer GCN (fast to implement, often competitive)
Has rich edge features (bond types, relation types)?
+-- -> MPNN/GINE with edge message functions
========================================================================
Appendix F: Notation Summary for This Section
| Symbol | Definition | First Used |
|---|---|---|
| Undirected graph with vertex set , edge set | 2.1 | |
| , | Number of nodes and edges | 2.1 |
| Node feature matrix; row is | 2.1 | |
| Neighborhood of node : | 1.2 | |
| Degree of node | 2.1 | |
| Representation of node at layer | 3.1 | |
| All node representations at layer | 4.2 | |
| Adjacency matrix | 2.1 | |
| Adjacency with self-loops | 4.2 | |
| Degree matrix: | 4.2 | |
| Degree matrix of | 4.2 | |
| GCN propagation matrix | 4.2 | |
| Unnormalized Laplacian | 8.2 | |
| Normalized (symmetric) Laplacian | 10.2 | |
| Dirichlet energy of | 8.2 | |
| Aggregated message at node , layer | 3.1 | |
| Message function at layer | 3.1 | |
| Update function at layer | 3.1 | |
| Readout function for graph-level tasks | 3.1 | |
| Weight matrix at layer | 4.2 | |
| GAT attention coefficient from to | 6.2 | |
| WL color of node at iteration | 7.2 | |
| GIN learnable mixing parameter | 7.4 | |
| Mean Average Distance of node representations | 8.1 | |
| DiffPool soft cluster assignment matrix | 9.3 | |
| Positional encoding of node | 10.2 | |
| First eigenvectors of | 10.2 | |
| Shortest-path distance between and | 10.4 | |
| Graphormer spatial encoding bias at distance | 10.4 | |
| permutation matrix | 2.3 | |
| Stationary distribution of random walk on | 8.1 | |
| Graph neural tangent kernel | C.1 | |
| Weisfeiler-Leman graph kernel | C.1 |
All vectors are column vectors by default ( means ). Matrix norms: (Frobenius), (spectral). Notation follows docs/NOTATION_GUIDE.md throughout.
Appendix G: Further Reading
Foundational Papers (Must-Read)
-
Scarselli et al. (2009) - "The Graph Neural Network Model." IEEE Transactions on Neural Networks. Original GNN formulation with fixed-point iteration. Historical foundation.
-
Defferrard, Bresson & Vandergheynst (2016) - "Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering." NeurIPS 2016. ChebNet: polynomial spectral filters, making spectral GNNs computationally feasible.
-
Kipf & Welling (2017) - "Semi-Supervised Classification with Graph Convolutional Networks." ICLR 2017. GCN: the simplification that started the field. Short, clear, highly cited.
-
Hamilton, Ying & Leskovec (2017) - "Inductive Representation Learning on Large Graphs." NeurIPS 2017. GraphSAGE: inductive learning, neighbor sampling, unsupervised objectives.
-
Gilmer et al. (2017) - "Neural Message Passing for Quantum Chemistry." ICML 2017. MPNN framework: the unified abstraction. The formalism most GNN papers use today.
-
Velickovic et al. (2018) - "Graph Attention Networks." ICLR 2018. GAT: learned attention weights for graphs. Widely deployed in production.
-
Xu et al. (2019) - "How Powerful are Graph Neural Networks?" ICLR 2019. GIN + WL expressiveness theorem: the theoretical foundation of GNN expressiveness.
Depth Pathologies
-
Li et al. (2018) - "Deeper Insights into Graph Convolutional Networks for Semi-Supervised Classification." AAAI 2018. First formal analysis of over-smoothing.
-
Alon & Yahav (2021) - "On the Bottleneck of Graph Neural Networks and its Practical Implications." ICLR 2021. Over-squashing: Jacobian analysis and graph rewiring motivation.
-
Chen et al. (2020) - "Simple and Deep Graph Convolutional Networks." ICML 2020. GCNII: initial residual + identity mapping for training 64-layer GCNs.
Graph Transformers
-
Ying et al. (2021) - "Do Transformers Really Perform Bad for Graph Representation?" NeurIPS 2021. Graphormer: spatial, central, and edge encodings for graph transformers. OGB-LSC winner.
-
Rampasek et al. (2022) - "Recipe for a General, Powerful, Scalable Graph Transformer." NeurIPS 2022. GPS framework: MPNN + Transformer combination with arbitrary PEs.
-
Dwivedi et al. (2022) - "Long Range Graph Benchmark." NeurIPS 2022 Datasets. LRGB: benchmarks specifically designed to test long-range dependency learning.
Expressiveness
-
Morris et al. (2019) - "Weisfeiler and Leman Go Neural: Higher-Order Graph Neural Networks." AAAI 2019. -GNN: implementing -WL as a neural network.
-
Brody, Alon & Yahav (2022) - "How Attentive are Graph Attention Networks?" ICLR 2022. GATv2: fixing the static attention problem in GAT.
Applications
-
Jumper et al. (2021) - "Highly Accurate Protein Structure Prediction with AlphaFold." Nature. AlphaFold 2: the Evoformer and structure module. GNNs at scale for biology.
-
Ying et al. (2018) - "Graph Convolutional Neural Networks for Web-Scale Recommender Systems." KDD 2018. PinSage: GraphSAGE at 3 billion nodes. Engineering lessons for production GNNs.
-
Edge et al. (2024) - "From Local to Global: A Graph RAG Approach to Query-Focused Summarization." Microsoft Research. Graph RAG: GNNs + LLMs for document understanding.
Textbooks and Surveys
-
Hamilton (2020) - Graph Representation Learning. Synthesis Lectures on AI. The definitive textbook on GNNs; covers theory and practice comprehensively.
-
Wu et al. (2022) - "A Comprehensive Study on Large-Scale Graph Training." NeurIPS 2022. Empirical comparison of full-batch, neighbor sampling, and subgraph sampling methods.