
TL;DR
- task : graph representation
- problem : There are Graph Attention Network (GAT) models that use attention to represent graphs, but they are hard to generalize because real graphs are very large and noisy.
- IDEA: Let’s have one attention score across all layers that chooses which edges are important and leave all others at 0.
- architecture : GAT, but only one attention score is selected.
- objective : cross entropy loss and loss for the number of edges
- baseline : GCN, GAT, GraphSage
- data : Cora, Citesser, Pubmed, Amazon computer, Amazon Photo, PPI, Reddit
- result : Similar performance to GAT. More interpretable because only one is selected per layer.
- contribution : Shows that GNNs that spare first perform as well as GNNs that do not.
- Limitations or things I don’t understand : GNN is transductive, but I’m not sure I understand. How is $z_{ij}$ learned?
Details
GNN in general
Graph $G$ = ( $V$, $E$ ) consists of nodes ( $V$ ) and edges ( $E$ )
nodes are represented by feature $X$. The dimensionality is $N$ (number of nodes) x $D$ (feature dimension)
The adjacency matrix $A$ is a matrix that is 1 if it is connected by edges, and 0 otherwise. The dimension is $N$x$N$ What we eventually want to do is take features $X$ and $A$ and create an enriched node representation $H$. A function is usually defined like this

Let $f$ be a function that encodes the graph.
where loss is the cross entropy loss if you’re doing something like a node classification task
In the end, the question is: how do the different variants of GNN organize $f$?
Neighbor Aggregation Methods
One of the most efficient methods for graph learning is the neighbor aggregation mechanism, which aggregates feature vectors over feature vector $x_i$ and its neighbors, j.
For example, Graph Convolution Network (GCN) is one of them, and you can write the expression like this

A more generalized version of this would look like this

However, GCN can only be used transductively, which means that it needs to be retrained when the graph structure changes.
Graph Attention Networks(GAT)
Similar to GCN, we aggregate neighborhoods, which becomes GAT when we use attention to get an attention score to determine which ege to focus on.

However, in this case, the attention score can be viewed as the importance of each edge, which is difficult to interpret because the attention score is different for each layer. Suggest SGAT to clean up noisy/task-irrelevant edges when creating graphs!
SparseGAT(SGAT)
To keep only the important edges, we add a binary gate $z_{ij}$ for each ege. This $z_{ij}$ will do binary masking on whether to use edge $e_{ij}$ or not.

Add L0 loss to the loss term to leave as few edges as possible. If $z_{ij}$ is 1, then $z_{ij}$ is the term that sums to 1 or 0. (loss over the number of edges)

An attention based aggregation function can be written as follows, which is no different from (GAT)

We get the attention score as follows

-> How is $z_{ij}$ learned?
The reason for this sparse configuration is that I tried to get the variance per layer per head for attention score and it was almost zero, as shown below.
