Codesota · Guides · Graph neural networksPractical · 5 architectures · 3 OGB benchmarksPublished March 2026

Guide · Graph ML

Graph neural networks: when and why.

Not everything is a graph problem, and not every graph problem needs a GNN. This guide helps you decide when graph structure genuinely helps — and which architecture to reach for when it does.

Five architectures, three OGB benchmarks, four code examples. PyTorch Geometric throughout.

§ 01 · Fit

When you need a GNN, and when you don't.

A flowchart, followed by four questions that usually resolve it.

GNNs shine when…

Structure is informative: a molecule's 3D connectivity determines its properties.
Topology varies: different samples have different graph structures.
Relational reasoning matters: predicting interactions, influence, links.
Inductive generalisation: unseen graphs at inference time.
Sparse connections: each node connects to few neighbors relative to graph size.

Skip GNNs when…

Data is tabular: XGBoost still beats GNNs on feature-rich tabular data.
Graph is fully connected: you have reinvented attention — use a Transformer.
Edges are artificial: constructing a KNN graph rarely beats an MLP.
Sequential structure dominates: sequence models capture the inductive bias better.
Scale is tiny: under 100 nodes, classical methods often suffice.

The decision flowchart

1. Is your data naturally a graph? No → tabular/sequence first. Yes → continue.

2. Does topology vary across samples? No, single fixed graph → spectral or label propagation. Yes → GNN.

3. Generalise to unseen graphs? No → GCN or spectral. Yes → GraphSAGE, GIN, or GPS.

4. Long-range dependencies critical? No → GCN/GAT (2–3 layers). Yes → GPS or deep GNN with skips.

§ 02 · Architectures

Five architectures, simply explained.

GCNGraph Convolutional Network

Kipf & Welling, 2017

One-linerEach node averages its neighbors' features (weighted by degree), then applies a linear transform.

IntuitionThink of it as a CNN where the “kernel” slides over a node's neighborhood instead of a pixel grid. Symmetric normalisation prevents high-degree nodes from dominating.

+ Simple, fast, well-understood+ Strong baseline for node classification− All neighbors weighted equally− Transductive (fixed graph at train time)

GATGraph Attention Network

Velickovic et al., 2018

One-linerLike GCN, but learns attention weights for each edge — some neighbors matter more than others.

IntuitionIn a citation network, not all papers that cite yours are equally relevant. GAT learns a small neural network that scores each edge, softmax-normalised. Multi-head attention stabilises training.

+ Adaptive, interpretable attention+ Works on heterogeneous graphs− More memory than GCN− GATv2 fixes expressiveness issues in original

GraphSAGESample and Aggregate

Hamilton et al., 2017

One-linerSamples a fixed-size neighborhood, aggregates features, concatenates with the node's own embedding.

IntuitionThe key insight is sampling. Instead of using the full neighborhood (impossible for billion-node graphs), sample K neighbors at each layer. This enables mini-batch training and inductive learning — the model generalises to unseen nodes and graphs.

+ Scales to millions of nodes+ Inductive (unseen graphs)+ Mini-batch training− Sampling introduces variance− Aggregator choice matters

GINGraph Isomorphism Network

Xu et al., 2019

One-linerThe most expressive message-passing GNN — provably as powerful as the 1-WL graph isomorphism test.

IntuitionGCN and GraphSAGE use mean/max aggregation, which cannot distinguish certain graph structures. GIN uses sum aggregation with an MLP, preserving multiset information. Different local structures map to different embeddings — the go-to choice for graph classification.

+ Maximally expressive within MPNN+ Best for graph-level tasks− Limited by 1-WL expressiveness ceiling− Sum aggregation can be unstable on large neighborhoods

GPSGeneral, Powerful, Scalable Graph Transformer

Rampasek et al., 2022

One-linerCombines local message passing (MPNN) with global self-attention in each layer, plus positional/structural encodings.

IntuitionStandard GNNs suffer from over-squashing: information from distant nodes gets bottlenecked. GPS fixes this by adding a Transformer-style global attention path alongside the local GNN path. Positional encodings (Laplacian eigenvectors, random walk statistics) give the model topology awareness.

+ Long-range dependencies+ SOTA on many OGB benchmarks+ Flexible — any MPNN + any attention− O(n²) attention for global component− More hyperparameters to tune

§ 03 · Benchmarks

Open Graph Benchmark, 2026.

Numbers from the OGB leaderboards. Realistic, reproducible — not cherry-picked.

ogbg-molpcba

Graph classification: 128 bioassay labels, ~438K molecules. Metric: AP.

Model	Type	Test AP	Params
GPS + virtual node	Graph Transformer	0.3212	~6.2M
GIN + virtual node	MPNN	0.2921	~3.4M
GCN + virtual node	MPNN	0.2724	~2.0M
GIN (no VN)	MPNN	0.2703	~1.9M
GCN (no VN)	MPNN	0.2424	~1.5M

Virtual nodes add a global node connected to all others — a simple way to enable long-range message passing without full attention.

ogbn-arxiv

Node classification: subject area of ~170K arXiv CS papers. Metric: Accuracy.

Model	Type	Test acc.	Layers
GIANT-XRT + RevGAT	GAT + LM features	77.36%	3
GAT + label propagation	GAT + post-proc	74.15%	3
GraphSAGE	MPNN (inductive)	71.49%	3
GCN	MPNN	71.74%	3
MLP (no graph)	Baseline	55.50%	3

The MLP baseline shows how much graph structure matters here: +16 points from adding edges.

ogbl-collab

Link prediction: future collaborations between ~235K authors. Metric: Hits@50.

Model	Type	Hits@50
S2GAE + SEAL	GNN + subgraph	66.79%
BUDDY	GNN + hashing	65.94%
GraphSAGE + edge features	MPNN	54.63%
Common Neighbors	Heuristic	44.75%

Link prediction is where GNNs combined with structural features dominate simple heuristics.

§ 04 · Applications

Where GNNs actually ship.

Drug discovery: Molecules are graphs. Atoms are nodes, bonds are edges. GNNs predict molecular properties (toxicity, binding affinity, solubility) directly from structure. GIN, SchNet, DimeNet++ for 3D, GPS for long-range. Used by Recursion, Insilico Medicine, Relay Therapeutics.
Social networks: User-user and user-item interactions form massive graphs. GNNs power recommendation, community detection, influence prediction. PinSage (Pinterest), GraphSAGE at scale. Used by Pinterest, Twitter/X, LinkedIn, Snap.
Fraud detection: Fraudsters form rings — accounts connected by shared devices, IPs, payment methods. Individual transactions look normal; the network structure is anomalous. Heterogeneous GNNs (R-GCN), temporal GNNs. Used by PayPal, Stripe, Amazon.
Recommendations: User-item bipartite graphs capture collaborative filtering signals. GNNs propagate preferences through the graph. LightGCN, PinSage, NGCF. Handles cold-start through graph connectivity. Used by Pinterest, Uber Eats, Kuaishou.

§ 05 · Code

Four examples, PyTorch Geometric.

Install: pip install torch-geometric

gcn_cora.py — node classification with GCN

import torch
import torch.nn.functional as F
from torch_geometric.nn import GCNConv
from torch_geometric.datasets import Planetoid

# Load Cora citation network
dataset = Planetoid(root='data/', name='Cora')
data = dataset[0]

class GCN(torch.nn.Module):
    def __init__(self, in_channels, hidden, out_channels):
        super().__init__()
        self.conv1 = GCNConv(in_channels, hidden)
        self.conv2 = GCNConv(hidden, out_channels)

    def forward(self, x, edge_index):
        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = F.dropout(x, p=0.5, training=self.training)
        x = self.conv2(x, edge_index)
        return F.log_softmax(x, dim=1)

model = GCN(dataset.num_features, 64, dataset.num_classes)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)

# Training loop
for epoch in range(200):
    model.train()
    optimizer.zero_grad()
    out = model(data.x, data.edge_index)
    loss = F.nll_loss(out[data.train_mask], data.y[data.train_mask])
    loss.backward()
    optimizer.step()

# Evaluate
model.eval()
pred = model(data.x, data.edge_index).argmax(dim=1)
correct = (pred[data.test_mask] == data.y[data.test_mask]).sum()
acc = int(correct) / int(data.test_mask.sum())
print(f"Test accuracy: {acc:.4f}")  # ~81.5%

gat.py — Graph Attention Network

from torch_geometric.nn import GATConv

class GAT(torch.nn.Module):
    """Graph Attention Network - learns WHICH neighbors matter."""
    def __init__(self, in_channels, hidden, out_channels, heads=8):
        super().__init__()
        self.conv1 = GATConv(in_channels, hidden, heads=heads, dropout=0.6)
        self.conv2 = GATConv(hidden * heads, out_channels, heads=1,
                             concat=False, dropout=0.6)

    def forward(self, x, edge_index):
        x = F.dropout(x, p=0.6, training=self.training)
        x = F.elu(self.conv1(x, edge_index))
        x = F.dropout(x, p=0.6, training=self.training)
        x = self.conv2(x, edge_index)
        return F.log_softmax(x, dim=1)

# GAT typically hits ~83.0% on Cora (vs ~81.5% for GCN)
# The attention weights are interpretable - you can visualize them

graphsage_minibatch.py — scalable inductive learning

from torch_geometric.nn import SAGEConv
from torch_geometric.loader import NeighborLoader

class GraphSAGE(torch.nn.Module):
    """Inductive learning - works on unseen nodes/graphs."""
    def __init__(self, in_channels, hidden, out_channels):
        super().__init__()
        self.conv1 = SAGEConv(in_channels, hidden)
        self.conv2 = SAGEConv(hidden, out_channels)

    def forward(self, x, edge_index):
        x = F.relu(self.conv1(x, edge_index))
        x = F.dropout(x, p=0.5, training=self.training)
        x = self.conv2(x, edge_index)
        return x

# Mini-batch training for large graphs (millions of nodes)
loader = NeighborLoader(
    data,
    num_neighbors=[25, 10],  # Sample 25 1-hop, 10 2-hop neighbors
    batch_size=1024,
    input_nodes=data.train_mask,
)

for batch in loader:
    out = model(batch.x, batch.edge_index)
    loss = F.cross_entropy(out[:batch.batch_size], batch.y[:batch.batch_size])
    loss.backward()
    optimizer.step()

gin_molecules.py — graph-level classification with GIN

from torch_geometric.nn import GINConv, global_add_pool
from torch_geometric.datasets import MoleculeNet

class GIN(torch.nn.Module):
    """Graph Isomorphism Network - maximally expressive message passing."""
    def __init__(self, in_channels, hidden, out_channels, num_layers=5):
        super().__init__()
        self.convs = torch.nn.ModuleList()
        self.bns = torch.nn.ModuleList()

        for i in range(num_layers):
            dim_in = in_channels if i == 0 else hidden
            mlp = torch.nn.Sequential(
                torch.nn.Linear(dim_in, hidden),
                torch.nn.ReLU(),
                torch.nn.Linear(hidden, hidden),
            )
            self.convs.append(GINConv(mlp))
            self.bns.append(torch.nn.BatchNorm1d(hidden))

        self.classifier = torch.nn.Linear(hidden, out_channels)

    def forward(self, x, edge_index, batch):
        for conv, bn in zip(self.convs, self.bns):
            x = F.relu(bn(conv(x, edge_index)))
        # Global pooling: graph-level readout
        x = global_add_pool(x, batch)
        return self.classifier(x)

# Molecular property prediction
dataset = MoleculeNet(root='data/', name='ogbg-molpcba')

§ 06 · Contrast

GNNs vs Transformers, nuanced.

The “Transformers will replace GNNs” narrative is simplistic. The most powerful architectures combine both.

Dimension	Message-passing GNNs	Graph Transformers
Complexity	O(\|E\|) — linear in edges	O(\|V\|²) — quadratic in nodes
Receptive field	K-hop (K = layers)	Global (all nodes, one layer)
Structure awareness	Built-in (edges define compute)	Needs positional encodings
Scalability	Millions of nodes with sampling	~10K nodes without sparse attn
Over-squashing	Major issue for deep GNNs	Avoided (global attention)
Best for	Large sparse graphs, local patterns	Small/medium graphs, long-range

Pragmatic take (2026)

Small molecular graphs (<100 nodes): GPS or Graphormer wins. Global attention cost is negligible.
Large social/citation graphs (100K+): GraphSAGE or GCN with sampling. Full attention is computationally impossible.
Medium graphs, varying structure: GAT gives the best interpretability/performance tradeoff.
Graph-level classification: GIN as strong MPNN baseline, GPS for SOTA.

§ 07 · Concepts

What you'll actually need to know.

Message passing: Every GNN follows the same pattern: collect messages from neighbors, aggregate (sum, mean, max, attention), update the node's embedding. One round = one layer = one hop. Stack K layers to see K hops. The aggregation function determines expressiveness.
Over-smoothing: Stack too many layers and all node embeddings converge to the same vector. Each layer averages over neighborhoods, so after ~6 layers every node has “seen” most of the graph. Practical fix: 2–3 layers. Deeper: add skip connections, normalisation, or use GPS.
Over-squashing: Related but different. Information from distant nodes must pass through bottleneck nodes, causing exponential compression. This is why GNNs struggle with long-range dependencies on tree-like graphs. Solutions: virtual nodes, graph rewiring, Graph Transformers.
Positional & structural encodings: Unlike sequences, graphs have no canonical node ordering. Positional encodings (Laplacian eigenvectors, random walk probabilities) give nodes a sense of “where.” Structural encodings (degree, centrality, subgraph counts) capture “what role.” Both are critical for Graph Transformers and increasingly used with MPNNs.

Related · Further reading