
TL;DR
- I read this because.. : NeurIPS2023
- task : image classification, object detection
- problem : CNN has a sliding window and ViT cuts image patches and puts them in sequentially, but I want something more fluid.
- idea : Cut the image patch and view it as a node and use GNN
- architecture : multi-head max relative GCN + linear + BN + relu + linear + BN + FFN stacked in multiple layers.
- baseline : ResNet, CycleMLP, Swin-T
- data : ImageNet ILSVRC 2012, COCO2017
- RESULT : SOTA with similar flops compared to the TINY model.
- contribution : first gnn model for image representation
- limitation/things I cannot understand : I’m not sure what’s good about it compared to ViT ;; Anyway, cutting it into patch is the same and entering it sequentially, but connectivity can be done with SA.. khmmm.
Details
motivation

Preliminaries
GNN in general https://github.com/long8v/PTIR/issues/55
over-smoothing problem Node embeddings become more similar as layers get deeper

https://ydy8989.github.io/2021-03-03-GAT/
Architecture
Graph Structure of Image
Divide $H \times W \times 3$ into $N$ patches. Represent each patch as a feature vector $\mathrm{x}_i \in \mathbb{R}^D$ to get $X=[\mathrm{x}_1, \mathrm{x}_2, … \mathrm{x}_N]$. This feature can be represented as an unordered node $\mathcal{V}={v_1, v_2, … v_N}$. For each node $v_i$, find the K nearest neighbors $\mathcal{N}(\mathcal{v}_i)$ and add an edge $e_ij$. Then we get the graph $\mathcal{G}=(\mathcal{V},\mathcal{E})$, which we can pass through the GNN! We will write the graph construction $\mathcal{G}=G(X)$.
- Advantages of representing with Graph
- A graph is a very generalized representation of structure! A grid in CNN or a sequence in ViT can be seen as a specific kind of graph.
- Graphs may have advantages over grids and sequences for representing complex objects with variable shapes
- objects can be viewed as a combination of parts (head torso arms legs in the case of a person), so they may have more strength in combining these parts
- Leverage the latest GNN architecture
Graph-level processing

Starting with feature $X \in \mathbb{R}^{n\times D}$, pick a graph based feature $\mathcal{G}=G(X)$.
Graph convolutional layer exchanges information while aggregating features between neighboring nodes

A more specific way to write this is to aggregate the neighbor information for node $x_i$ to create $x_i’$.
We will use max-relative graph convolution

When aggregating, take the max of the difference between features to aggregate
Trying to write multi-head operation here. Introduced because of feature diversity.

ViG block
In the case of the previous GC, the differences between node features were lost as the graph convolution layer repeatedly progressed, resulting in performance degradation.

So the ViG block wants to add more feature transformation and nonlinear activation.
Put linear layer after GCN layer and also put nonlinear activation. + FFN also added

This resulted in better diversity than ResGCN. (figure 3 above)
Network Architecture
Isotropic architecture
Models like ViT or ResMLP where features are all the same size

Pyamid architecture
Features are getting smaller and smaller, like ResNet and PVT.

PE
absolute pe plusResult
Experiment detail

Result for isotropic

Result for Pyramid

Object Detection result

visualization

etc
- max relative graph convolution
DeepGCNs: Can GCNs Go as Deep as CNNs? Suggested by https://arxiv.org/pdf/1904.03751.pdf Similar to ResNet, but most GCN models were 4 layers or less due to the over-smoothing above. To address this, we wrote a paper on How do I deepen my GCN?
residual / dense connection

dilation
