[96] Vision GNN: An Image is Worth Graph of Nodes

paper , code

TL;DR

I read this because.. : NeurIPS2023
task : image classification, object detection
problem : CNN has a sliding window and ViT cuts image patches and puts them in sequentially, but I want something more fluid.
idea : Cut the image patch and view it as a node and use GNN
architecture : multi-head max relative GCN + linear + BN + relu + linear + BN + FFN stacked in multiple layers.
baseline : ResNet, CycleMLP, Swin-T
data : ImageNet ILSVRC 2012, COCO2017
RESULT : SOTA with similar flops compared to the TINY model.
contribution : first gnn model for image representation
limitation/things I cannot understand : I’m not sure what’s good about it compared to ViT ;; Anyway, cutting it into patch is the same and entering it sequentially, but connectivity can be done with SA.. khmmm.

Details

motivation

Preliminaries

GNN in general https://github.com/long8v/PTIR/issues/55
over-smoothing problem Node embeddings become more similar as layers get deeper

https://ydy8989.github.io/2021-03-03-GAT/

Architecture

Graph Structure of Image

Divide $H \times W \times 3$ into $N$ patches. Represent each patch as a feature vector $\mathrm{x}_i \in \mathbb{R}^D$ to get $X=[\mathrm{x}_1, \mathrm{x}_2, … \mathrm{x}_N]$. This feature can be represented as an unordered node $\mathcal{V}={v_1, v_2, … v_N}$. For each node $v_i$, find the K nearest neighbors $\mathcal{N}(\mathcal{v}_i)$ and add an edge $e_ij$. Then we get the graph $\mathcal{G}=(\mathcal{V},\mathcal{E})$, which we can pass through the GNN! We will write the graph construction $\mathcal{G}=G(X)$.

Advantages of representing with Graph

A graph is a very generalized representation of structure! A grid in CNN or a sequence in ViT can be seen as a specific kind of graph.
Graphs may have advantages over grids and sequences for representing complex objects with variable shapes
objects can be viewed as a combination of parts (head torso arms legs in the case of a person), so they may have more strength in combining these parts
Leverage the latest GNN architecture

Graph-level processing

Starting with feature $X \in \mathbb{R}^{n\times D}$, pick a graph based feature $\mathcal{G}=G(X)$. Graph convolutional layer exchanges information while aggregating features between neighboring nodes

A more specific way to write this is to aggregate the neighbor information for node $x_i$ to create $x_i’$. We will use max-relative graph convolution

When aggregating, take the max of the difference between features to aggregate

Trying to write multi-head operation here. Introduced because of feature diversity.

ViG block

In the case of the previous GC, the differences between node features were lost as the graph convolution layer repeatedly progressed, resulting in performance degradation.

So the ViG block wants to add more feature transformation and nonlinear activation. Put linear layer after GCN layer and also put nonlinear activation. + FFN also added

This resulted in better diversity than ResGCN. (figure 3 above)

Network Architecture

Isotropic architecture

Models like ViT or ResMLP where features are all the same size

Pyamid architecture

Features are getting smaller and smaller, like ResNet and PVT.

PE

absolute pe plus

Result

Experiment detail

Result for isotropic

Result for Pyramid

Object Detection result

visualization

etc

max relative graph convolution

DeepGCNs: Can GCNs Go as Deep as CNNs? Suggested by https://arxiv.org/pdf/1904.03751.pdf Similar to ResNet, but most GCN models were 4 layers or less due to the over-smoothing above. To address this, we wrote a paper on How do I deepen my GCN?

residual / dense connection
dilation

TL;DR#

Details#

motivation#

Preliminaries#

Architecture#

Graph Structure of Image#

Graph-level processing#

ViG block#

Network Architecture#

Isotropic architecture#

Pyamid architecture#

PE#

Result#

Experiment detail#

Result for isotropic#

Result for Pyramid#

Object Detection result#

visualization#

etc#