TL;DR
- I read this because.. : mentioned. MaskCLIP seems to have copied this, and it seems to be a classic.
- task : image classification
- problem : In ImageNet, it is labeled as having one class, but it actually has multiple objects, which is especially problematic when cropping.
- idea : relabel pixel-wise to multi-label with a powerful image classifier trained with extra data.
- input/output : (teacher) image -> pixel wise multi label. (student) image -> class
- architecture : RseNet / EfficientNet-L2. (teacher) discard GAP and use linear layer at the end as classifier with 1x1 conv (training x)
- objective : cross-entropy loss. (student) crop, ROI align the label map created by teacher, and then use softmax as supervision.
- baseline : learn with one-hot ImageNet labels / label smoothing / label cleaning
- data : ImageNet / teacher is finetune with super-ImageNet scale(JFT-300M or InstagramNet-1B) -> ImageNet
- evaluation : accuracy
- result : Improved performance, especially when used with CutMix
- contribution : Imaginet labels are problematic and there have been some attempts to improve them, but this is better than them and more efficient in terms of pre-calculating LabelPool rather than KD.
- etc. :
Details
motivation
If I randomly crop, only 23.5% of the time the real object and IoU are above 0.5…
Re-Labeling ImageNet
Finetune super-ImageNet scale trained with JFT-300M /InstagramNet-1B to ImageNet -> propensity to predict with multi-label when single label but noisy label + cross entropy For example, for an image X, if the Label is both 0 and 1, the CE LOSS is optimally predicted to be (1/2, 1/2).
If we use the last classifier as a 1x1 conv weight for w x h without global pooling, we can get a classifier for each pixel! (1 x 1 conv Don’t overthink it, just think of it as w x h x d -> (GAP) 1 x d -> 1 x C instead of w x h x d -> w x h x c) For related work, see Fully Convolutional Networks for Semantic Segmentation / CAM!
How to use it for real-world learning
Save the label map for imagenet in advance. When the image is cropped, RoI Align that part in the label map -> take the softmax for the label that came out and use it as a soft label (higher area has a higher class?)
Results
I tried other tasks with the trained backbone and it is better than image net pretrained.
Ablations
If the element is 1) multi-label 2) localized, either re-gap or ablation with argmax. Both were major performance factors