[131] Re-labeling ImageNet: from Single to Multi-Labels, from Global to Localized Labels

paper

TL;DR

I read this because.. : mentioned. MaskCLIP seems to have copied this, and it seems to be a classic.
task : image classification
problem : In ImageNet, it is labeled as having one class, but it actually has multiple objects, which is especially problematic when cropping.
idea : relabel pixel-wise to multi-label with a powerful image classifier trained with extra data.
input/output : (teacher) image -> pixel wise multi label. (student) image -> class
architecture : RseNet / EfficientNet-L2. (teacher) discard GAP and use linear layer at the end as classifier with 1x1 conv (training x)
objective : cross-entropy loss. (student) crop, ROI align the label map created by teacher, and then use softmax as supervision.
baseline : learn with one-hot ImageNet labels / label smoothing / label cleaning
data : ImageNet / teacher is finetune with super-ImageNet scale(JFT-300M or InstagramNet-1B) -> ImageNet
evaluation : accuracy
result : Improved performance, especially when used with CutMix
contribution : Imaginet labels are problematic and there have been some attempts to improve them, but this is better than them and more efficient in terms of pre-calculating LabelPool rather than KD.
etc. :

Details

motivation

If I randomly crop, only 23.5% of the time the real object and IoU are above 0.5…

Re-Labeling ImageNet

Finetune super-ImageNet scale trained with JFT-300M /InstagramNet-1B to ImageNet -> propensity to predict with multi-label when single label but noisy label + cross entropy For example, for an image X, if the Label is both 0 and 1, the CE LOSS is optimally predicted to be (1/2, 1/2).

If we use the last classifier as a 1x1 conv weight for w x h without global pooling, we can get a classifier for each pixel! (1 x 1 conv Don’t overthink it, just think of it as w x h x d -> (GAP) 1 x d -> 1 x C instead of w x h x d -> w x h x c) For related work, see Fully Convolutional Networks for Semantic Segmentation / CAM!

How to use it for real-world learning

Save the label map for imagenet in advance. When the image is cropped, RoI Align that part in the label map -> take the softmax for the label that came out and use it as a soft label (higher area has a higher class?)

Results

I tried other tasks with the trained backbone and it is better than image net pretrained.

Ablations

If the element is 1) multi-label 2) localized, either re-gap or ablation with argmax. Both were major performance factors

TL;DR#

Details#

motivation#

Re-Labeling ImageNet#

Results#

Ablations#