image

paper , code

TL;DR

  • I read this because.. : NeurIPS2023
  • task : image classification, object detection
  • problem : CNN has a sliding window and ViT cuts image patches and puts them in sequentially, but I want something more fluid.
  • idea : Cut the image patch and view it as a node and use GNN
  • architecture : multi-head max relative GCN + linear + BN + relu + linear + BN + FFN stacked in multiple layers.
  • baseline : ResNet, CycleMLP, Swin-T
  • data : ImageNet ILSVRC 2012, COCO2017
  • RESULT : SOTA with similar flops compared to the TINY model.
  • contribution : first gnn model for image representation
  • limitation/things I cannot understand : I’m not sure what’s good about it compared to ViT ;; Anyway, cutting it into patch is the same and entering it sequentially, but connectivity can be done with SA.. khmmm.

Details

motivation

image

Preliminaries

https://ydy8989.github.io/2021-03-03-GAT/

Architecture

Graph Structure of Image

Divide $H \times W \times 3$ into $N$ patches. Represent each patch as a feature vector $\mathrm{x}_i \in \mathbb{R}^D$ to get $X=[\mathrm{x}_1, \mathrm{x}_2, … \mathrm{x}_N]$. This feature can be represented as an unordered node $\mathcal{V}={v_1, v_2, … v_N}$. For each node $v_i$, find the K nearest neighbors $\mathcal{N}(\mathcal{v}_i)$ and add an edge $e_ij$. Then we get the graph $\mathcal{G}=(\mathcal{V},\mathcal{E})$, which we can pass through the GNN! We will write the graph construction $\mathcal{G}=G(X)$.

  • Advantages of representing with Graph
  1. A graph is a very generalized representation of structure! A grid in CNN or a sequence in ViT can be seen as a specific kind of graph.
  2. Graphs may have advantages over grids and sequences for representing complex objects with variable shapes
  3. objects can be viewed as a combination of parts (head torso arms legs in the case of a person), so they may have more strength in combining these parts
  4. Leverage the latest GNN architecture

Graph-level processing

image

Starting with feature $X \in \mathbb{R}^{n\times D}$, pick a graph based feature $\mathcal{G}=G(X)$. Graph convolutional layer exchanges information while aggregating features between neighboring nodes image

A more specific way to write this is to aggregate the neighbor information for node $x_i$ to create $x_i’$. We will use max-relative graph convolution image

When aggregating, take the max of the difference between features to aggregate

Trying to write multi-head operation here. Introduced because of feature diversity. image

ViG block

In the case of the previous GC, the differences between node features were lost as the graph convolution layer repeatedly progressed, resulting in performance degradation. image

So the ViG block wants to add more feature transformation and nonlinear activation. Put linear layer after GCN layer and also put nonlinear activation. + FFN also added image image

This resulted in better diversity than ResGCN. (figure 3 above)

Network Architecture

Isotropic architecture

Models like ViT or ResMLP where features are all the same size image

Pyamid architecture

Features are getting smaller and smaller, like ResNet and PVT.

image

PE

image absolute pe plus

Result

Experiment detail

image

Result for isotropic

image

Result for Pyramid

image

Object Detection result

image

visualization

image

etc

  • max relative graph convolution

DeepGCNs: Can GCNs Go as Deep as CNNs? Suggested by https://arxiv.org/pdf/1904.03751.pdf Similar to ResNet, but most GCN models were 4 layers or less due to the over-smoothing above. To address this, we wrote a paper on How do I deepen my GCN?

  1. residual / dense connection image

  2. dilation image