TL;DR
- I read this because.. : Sunghyun recommended. region caption / detailed caption 생성에도 쓰인 것 같고 fine-grained clip에 대한 궁금증이 있어서 봄.
- task : CLIP with mask
- Problem :** CLIP pulls information globally, but I want a finer understanding. How can I understand the full context of the image and not distort the image itself?
- idea : attach conv operation before ViT in CLIP, RGB conv and alpha conv separately, maybe feature summarize and pass to ViT
- input/output : (clip) image + mask, text -> similarity
- architecture : CLIP
- objective : contrastive loss
- baseline : (image classification) CLIP, Red Circle, MaskCLIP (REC) CPT, ReCLIP, Red Circle (OVD) MaskImageNet, Detic-ImageNet. (MMLM) LLaVA-1.5, BLIP-2 , …
- data : generate rgba - region text with additional pipeline for GRIT-20m + ImageNet 460K
- evaluation : For each benchmark… Sometimes MMLM just swaps the backbone (which is possible by freezing the text encoder), sometimes it’s finetuned
- Result : Improved imagenet performance, reduced MLLM hallucinations, etc.
- contribution : Simple architecture + seems to have tackled a lot of problems without much learning.
- etc. : What do we do with SAM? Region level clips do reduce hallucinations.
Details
motivation
- image recognition: better classification (imagenet is single-label but actually multi-label) / can be used as referring expression comprehension (REC) / can be used for data generation for OVDs
- The backbone of MLLM: reducing hallucination and model bias
- generation: allows you to cherry-pick the parts you want and fixes problems with multi objects
Region-focusing strategy
The image itself is distorted or full contextual information is omitted/deleted
RGBA Region-Text Pair Generation
Alpha-CLIP
- text encoder freeze
- Add RGB conv + alpha conv
- alpha is between 0 and 1, but we want it to initially start at 0, so we use the
- And alpha + rgb conv is elementwise summation (not shown)
- 10% sample of epoxies trained with image - text pairs
Result
- ImageNet classification
imagenet-s has semantic segmentation hanging on imagenet, but performance improved when I gave that gt to alpha Or improved when given as a bbox, or just looking at the whole image as a mask, no performance drop (as if it’s learning with new data)
- REC
The pipeline is kinda weird, lol… SAM picks a bunch of masks and finds the one that’s closest to the text, and that’s the answer.
- OVD
Using a pseudo-labeling approach + AlphaCLIP as a backbone yields better performance
- Region-level captioning
Just swapping out the backbone is working.
The results of our finetuned quantitative evaluation are as follows