image

paper

TL;DR

  • I read this because.. : mentioned. MaskCLIP seems to have copied this, and it seems to be a classic.
  • task : image classification
  • problem : In ImageNet, it is labeled as having one class, but it actually has multiple objects, which is especially problematic when cropping.
  • idea : relabel pixel-wise to multi-label with a powerful image classifier trained with extra data.
  • input/output : (teacher) image -> pixel wise multi label. (student) image -> class
  • architecture : RseNet / EfficientNet-L2. (teacher) discard GAP and use linear layer at the end as classifier with 1x1 conv (training x)
  • objective : cross-entropy loss. (student) crop, ROI align the label map created by teacher, and then use softmax as supervision.
  • baseline : learn with one-hot ImageNet labels / label smoothing / label cleaning
  • data : ImageNet / teacher is finetune with super-ImageNet scale(JFT-300M or InstagramNet-1B) -> ImageNet
  • evaluation : accuracy
  • result : Improved performance, especially when used with CutMix
  • contribution : Imaginet labels are problematic and there have been some attempts to improve them, but this is better than them and more efficient in terms of pre-calculating LabelPool rather than KD.
  • etc. :

Details

motivation

image

If I randomly crop, only 23.5% of the time the real object and IoU are above 0.5…

image

Re-Labeling ImageNet

Finetune super-ImageNet scale trained with JFT-300M /InstagramNet-1B to ImageNet -> propensity to predict with multi-label when single label but noisy label + cross entropy For example, for an image X, if the Label is both 0 and 1, the CE LOSS is optimally predicted to be (1/2, 1/2).

If we use the last classifier as a 1x1 conv weight for w x h without global pooling, we can get a classifier for each pixel! (1 x 1 conv Don’t overthink it, just think of it as w x h x d -> (GAP) 1 x d -> 1 x C instead of w x h x d -> w x h x c) For related work, see Fully Convolutional Networks for Semantic Segmentation / CAM!

How to use it for real-world learning image

Save the label map for imagenet in advance. When the image is cropped, RoI Align that part in the label map -> take the softmax for the label that came out and use it as a soft label (higher area has a higher class?)

Results

image

I tried other tasks with the trained backbone and it is better than image net pretrained. image

Ablations

image

If the element is 1) multi-label 2) localized, either re-gap or ablation with argmax. Both were major performance factors