Codesota · Guides · Graph neural networksPractical · 5 architectures · 3 OGB benchmarksPublished March 2026
Guide · Graph ML

Graph neural networks: when and why.

Not everything is a graph problem, and not every graph problem needs a GNN. This guide helps you decide when graph structure genuinely helps — and which architecture to reach for when it does.

Five architectures, three OGB benchmarks, four code examples. PyTorch Geometric throughout.

§ 01 · Fit

When you need a GNN, and when you don't.

A flowchart, followed by four questions that usually resolve it.

GNNs shine when…
  • Structure is informative: a molecule's 3D connectivity determines its properties.
  • Topology varies: different samples have different graph structures.
  • Relational reasoning matters: predicting interactions, influence, links.
  • Inductive generalisation: unseen graphs at inference time.
  • Sparse connections: each node connects to few neighbors relative to graph size.
Skip GNNs when…
  • Data is tabular: XGBoost still beats GNNs on feature-rich tabular data.
  • Graph is fully connected: you have reinvented attention — use a Transformer.
  • Edges are artificial: constructing a KNN graph rarely beats an MLP.
  • Sequential structure dominates: sequence models capture the inductive bias better.
  • Scale is tiny: under 100 nodes, classical methods often suffice.
The decision flowchart
1. Is your data naturally a graph? No → tabular/sequence first. Yes → continue.
2. Does topology vary across samples? No, single fixed graph → spectral or label propagation. Yes → GNN.
3. Generalise to unseen graphs? No → GCN or spectral. Yes → GraphSAGE, GIN, or GPS.
4. Long-range dependencies critical? No → GCN/GAT (2–3 layers). Yes → GPS or deep GNN with skips.
§ 02 · Architectures

Five architectures, simply explained.

GCNGraph Convolutional Network
Kipf & Welling, 2017

One-linerEach node averages its neighbors' features (weighted by degree), then applies a linear transform.

IntuitionThink of it as a CNN where the “kernel” slides over a node's neighborhood instead of a pixel grid. Symmetric normalisation prevents high-degree nodes from dominating.

+ Simple, fast, well-understood+ Strong baseline for node classificationAll neighbors weighted equallyTransductive (fixed graph at train time)
GATGraph Attention Network
Velickovic et al., 2018

One-linerLike GCN, but learns attention weights for each edge — some neighbors matter more than others.

IntuitionIn a citation network, not all papers that cite yours are equally relevant. GAT learns a small neural network that scores each edge, softmax-normalised. Multi-head attention stabilises training.

+ Adaptive, interpretable attention+ Works on heterogeneous graphsMore memory than GCNGATv2 fixes expressiveness issues in original
GraphSAGESample and Aggregate
Hamilton et al., 2017

One-linerSamples a fixed-size neighborhood, aggregates features, concatenates with the node's own embedding.

IntuitionThe key insight is sampling. Instead of using the full neighborhood (impossible for billion-node graphs), sample K neighbors at each layer. This enables mini-batch training and inductive learning — the model generalises to unseen nodes and graphs.

+ Scales to millions of nodes+ Inductive (unseen graphs)+ Mini-batch trainingSampling introduces varianceAggregator choice matters
GINGraph Isomorphism Network
Xu et al., 2019

One-linerThe most expressive message-passing GNN — provably as powerful as the 1-WL graph isomorphism test.

IntuitionGCN and GraphSAGE use mean/max aggregation, which cannot distinguish certain graph structures. GIN uses sum aggregation with an MLP, preserving multiset information. Different local structures map to different embeddings — the go-to choice for graph classification.

+ Maximally expressive within MPNN+ Best for graph-level tasksLimited by 1-WL expressiveness ceilingSum aggregation can be unstable on large neighborhoods
GPSGeneral, Powerful, Scalable Graph Transformer
Rampasek et al., 2022

One-linerCombines local message passing (MPNN) with global self-attention in each layer, plus positional/structural encodings.

IntuitionStandard GNNs suffer from over-squashing: information from distant nodes gets bottlenecked. GPS fixes this by adding a Transformer-style global attention path alongside the local GNN path. Positional encodings (Laplacian eigenvectors, random walk statistics) give the model topology awareness.

+ Long-range dependencies+ SOTA on many OGB benchmarks+ Flexible — any MPNN + any attentionO(n²) attention for global componentMore hyperparameters to tune
§ 03 · Benchmarks

Open Graph Benchmark, 2026.

Numbers from the OGB leaderboards. Realistic, reproducible — not cherry-picked.

ogbg-molpcba

Graph classification: 128 bioassay labels, ~438K molecules. Metric: AP.

ModelTypeTest APParams
GPS + virtual nodeGraph Transformer0.3212~6.2M
GIN + virtual nodeMPNN0.2921~3.4M
GCN + virtual nodeMPNN0.2724~2.0M
GIN (no VN)MPNN0.2703~1.9M
GCN (no VN)MPNN0.2424~1.5M
Virtual nodes add a global node connected to all others — a simple way to enable long-range message passing without full attention.

ogbn-arxiv

Node classification: subject area of ~170K arXiv CS papers. Metric: Accuracy.

ModelTypeTest acc.Layers
GIANT-XRT + RevGATGAT + LM features77.36%3
GAT + label propagationGAT + post-proc74.15%3
GraphSAGEMPNN (inductive)71.49%3
GCNMPNN71.74%3
MLP (no graph)Baseline55.50%3
The MLP baseline shows how much graph structure matters here: +16 points from adding edges.

ogbl-collab

Link prediction: future collaborations between ~235K authors. Metric: Hits@50.

ModelTypeHits@50
S2GAE + SEALGNN + subgraph66.79%
BUDDYGNN + hashing65.94%
GraphSAGE + edge featuresMPNN54.63%
Common NeighborsHeuristic44.75%
Link prediction is where GNNs combined with structural features dominate simple heuristics.
§ 04 · Applications

Where GNNs actually ship.

Drug discovery
Molecules are graphs. Atoms are nodes, bonds are edges. GNNs predict molecular properties (toxicity, binding affinity, solubility) directly from structure. GIN, SchNet, DimeNet++ for 3D, GPS for long-range. Used by Recursion, Insilico Medicine, Relay Therapeutics.
Social networks
User-user and user-item interactions form massive graphs. GNNs power recommendation, community detection, influence prediction. PinSage (Pinterest), GraphSAGE at scale. Used by Pinterest, Twitter/X, LinkedIn, Snap.
Fraud detection
Fraudsters form rings — accounts connected by shared devices, IPs, payment methods. Individual transactions look normal; the network structure is anomalous. Heterogeneous GNNs (R-GCN), temporal GNNs. Used by PayPal, Stripe, Amazon.
Recommendations
User-item bipartite graphs capture collaborative filtering signals. GNNs propagate preferences through the graph. LightGCN, PinSage, NGCF. Handles cold-start through graph connectivity. Used by Pinterest, Uber Eats, Kuaishou.
§ 05 · Code

Four examples, PyTorch Geometric.

Install: pip install torch-geometric

gcn_cora.py — node classification with GCN
import torch
import torch.nn.functional as F
from torch_geometric.nn import GCNConv
from torch_geometric.datasets import Planetoid

# Load Cora citation network
dataset = Planetoid(root='data/', name='Cora')
data = dataset[0]

class GCN(torch.nn.Module):
    def __init__(self, in_channels, hidden, out_channels):
        super().__init__()
        self.conv1 = GCNConv(in_channels, hidden)
        self.conv2 = GCNConv(hidden, out_channels)

    def forward(self, x, edge_index):
        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = F.dropout(x, p=0.5, training=self.training)
        x = self.conv2(x, edge_index)
        return F.log_softmax(x, dim=1)

model = GCN(dataset.num_features, 64, dataset.num_classes)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)

# Training loop
for epoch in range(200):
    model.train()
    optimizer.zero_grad()
    out = model(data.x, data.edge_index)
    loss = F.nll_loss(out[data.train_mask], data.y[data.train_mask])
    loss.backward()
    optimizer.step()

# Evaluate
model.eval()
pred = model(data.x, data.edge_index).argmax(dim=1)
correct = (pred[data.test_mask] == data.y[data.test_mask]).sum()
acc = int(correct) / int(data.test_mask.sum())
print(f"Test accuracy: {acc:.4f}")  # ~81.5%
gat.py — Graph Attention Network
from torch_geometric.nn import GATConv

class GAT(torch.nn.Module):
    """Graph Attention Network - learns WHICH neighbors matter."""
    def __init__(self, in_channels, hidden, out_channels, heads=8):
        super().__init__()
        self.conv1 = GATConv(in_channels, hidden, heads=heads, dropout=0.6)
        self.conv2 = GATConv(hidden * heads, out_channels, heads=1,
                             concat=False, dropout=0.6)

    def forward(self, x, edge_index):
        x = F.dropout(x, p=0.6, training=self.training)
        x = F.elu(self.conv1(x, edge_index))
        x = F.dropout(x, p=0.6, training=self.training)
        x = self.conv2(x, edge_index)
        return F.log_softmax(x, dim=1)

# GAT typically hits ~83.0% on Cora (vs ~81.5% for GCN)
# The attention weights are interpretable - you can visualize them
graphsage_minibatch.py — scalable inductive learning
from torch_geometric.nn import SAGEConv
from torch_geometric.loader import NeighborLoader

class GraphSAGE(torch.nn.Module):
    """Inductive learning - works on unseen nodes/graphs."""
    def __init__(self, in_channels, hidden, out_channels):
        super().__init__()
        self.conv1 = SAGEConv(in_channels, hidden)
        self.conv2 = SAGEConv(hidden, out_channels)

    def forward(self, x, edge_index):
        x = F.relu(self.conv1(x, edge_index))
        x = F.dropout(x, p=0.5, training=self.training)
        x = self.conv2(x, edge_index)
        return x

# Mini-batch training for large graphs (millions of nodes)
loader = NeighborLoader(
    data,
    num_neighbors=[25, 10],  # Sample 25 1-hop, 10 2-hop neighbors
    batch_size=1024,
    input_nodes=data.train_mask,
)

for batch in loader:
    out = model(batch.x, batch.edge_index)
    loss = F.cross_entropy(out[:batch.batch_size], batch.y[:batch.batch_size])
    loss.backward()
    optimizer.step()
gin_molecules.py — graph-level classification with GIN
from torch_geometric.nn import GINConv, global_add_pool
from torch_geometric.datasets import MoleculeNet

class GIN(torch.nn.Module):
    """Graph Isomorphism Network - maximally expressive message passing."""
    def __init__(self, in_channels, hidden, out_channels, num_layers=5):
        super().__init__()
        self.convs = torch.nn.ModuleList()
        self.bns = torch.nn.ModuleList()

        for i in range(num_layers):
            dim_in = in_channels if i == 0 else hidden
            mlp = torch.nn.Sequential(
                torch.nn.Linear(dim_in, hidden),
                torch.nn.ReLU(),
                torch.nn.Linear(hidden, hidden),
            )
            self.convs.append(GINConv(mlp))
            self.bns.append(torch.nn.BatchNorm1d(hidden))

        self.classifier = torch.nn.Linear(hidden, out_channels)

    def forward(self, x, edge_index, batch):
        for conv, bn in zip(self.convs, self.bns):
            x = F.relu(bn(conv(x, edge_index)))
        # Global pooling: graph-level readout
        x = global_add_pool(x, batch)
        return self.classifier(x)

# Molecular property prediction
dataset = MoleculeNet(root='data/', name='ogbg-molpcba')
§ 06 · Contrast

GNNs vs Transformers, nuanced.

The “Transformers will replace GNNs” narrative is simplistic. The most powerful architectures combine both.

DimensionMessage-passing GNNsGraph Transformers
ComplexityO(|E|) — linear in edgesO(|V|²) — quadratic in nodes
Receptive fieldK-hop (K = layers)Global (all nodes, one layer)
Structure awarenessBuilt-in (edges define compute)Needs positional encodings
ScalabilityMillions of nodes with sampling~10K nodes without sparse attn
Over-squashingMajor issue for deep GNNsAvoided (global attention)
Best forLarge sparse graphs, local patternsSmall/medium graphs, long-range
Pragmatic take (2026)
  1. Small molecular graphs (<100 nodes): GPS or Graphormer wins. Global attention cost is negligible.
  2. Large social/citation graphs (100K+): GraphSAGE or GCN with sampling. Full attention is computationally impossible.
  3. Medium graphs, varying structure: GAT gives the best interpretability/performance tradeoff.
  4. Graph-level classification: GIN as strong MPNN baseline, GPS for SOTA.
§ 07 · Concepts

What you'll actually need to know.

Message passing
Every GNN follows the same pattern: collect messages from neighbors, aggregate (sum, mean, max, attention), update the node's embedding. One round = one layer = one hop. Stack K layers to see K hops. The aggregation function determines expressiveness.
Over-smoothing
Stack too many layers and all node embeddings converge to the same vector. Each layer averages over neighborhoods, so after ~6 layers every node has “seen” most of the graph. Practical fix: 2–3 layers. Deeper: add skip connections, normalisation, or use GPS.
Over-squashing
Related but different. Information from distant nodes must pass through bottleneck nodes, causing exponential compression. This is why GNNs struggle with long-range dependencies on tree-like graphs. Solutions: virtual nodes, graph rewiring, Graph Transformers.
Positional & structural encodings
Unlike sequences, graphs have no canonical node ordering. Positional encodings (Laplacian eigenvectors, random walk probabilities) give nodes a sense of “where.” Structural encodings (degree, centrality, subgraph counts) capture “what role.” Both are critical for Graph Transformers and increasingly used with MPNNs.
Related · Further reading

Continue through the registry.