[141] Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

paper , page , demo

TL;DR

I read this because.. : Sunghyun recommended. region caption / detailed caption 생성에도 쓰인 것 같고 fine-grained clip에 대한 궁금증이 있어서 봄.
task : CLIP with mask
Problem :** CLIP pulls information globally, but I want a finer understanding. How can I understand the full context of the image and not distort the image itself?
idea : attach conv operation before ViT in CLIP, RGB conv and alpha conv separately, maybe feature summarize and pass to ViT
input/output : (clip) image + mask, text -> similarity
architecture : CLIP
objective : contrastive loss
baseline : (image classification) CLIP, Red Circle, MaskCLIP (REC) CPT, ReCLIP, Red Circle (OVD) MaskImageNet, Detic-ImageNet. (MMLM) LLaVA-1.5, BLIP-2 , …
data : generate rgba - region text with additional pipeline for GRIT-20m + ImageNet 460K
evaluation : For each benchmark… Sometimes MMLM just swaps the backbone (which is possible by freezing the text encoder), sometimes it’s finetuned
Result : Improved imagenet performance, reduced MLLM hallucinations, etc.
contribution : Simple architecture + seems to have tackled a lot of problems without much learning.
etc. : What do we do with SAM? Region level clips do reduce hallucinations.

Details

motivation

image recognition: better classification (imagenet is single-label but actually multi-label) / can be used as referring expression comprehension (REC) / can be used for data generation for OVDs
The backbone of MLLM: reducing hallucination and model bias
generation: allows you to cherry-pick the parts you want and fixes problems with multi objects

Region-focusing strategy

The image itself is distorted or full contextual information is omitted/deleted

RGBA Region-Text Pair Generation

- Grounding data pipeline: GRiT data already has bounding boxes and region text. Run SAM on it to get masks - Classification data pipeline: SAM -> crop -> assign as clip score -> caption as BLIP and attach class as well

Alpha-CLIP

text encoder freeze
Add RGB conv + alpha conv
alpha is between 0 and 1, but we want it to initially start at 0, so we use the
And alpha + rgb conv is elementwise summation (not shown)
10% sample of epoxies trained with image - text pairs

Result

ImageNet classification

imagenet-s has semantic segmentation hanging on imagenet, but performance improved when I gave that gt to alpha Or improved when given as a bbox, or just looking at the whole image as a mask, no performance drop (as if it’s learning with new data)

The pipeline is kinda weird, lol… SAM picks a bunch of masks and finds the one that’s closest to the text, and that’s the answer.

Using a pseudo-labeling approach + AlphaCLIP as a backbone yields better performance

Region-level captioning Just swapping out the backbone is working.

The results of our finetuned quantitative evaluation are as follows

TL;DR#

Details#

motivation#

Region-focusing strategy#

RGBA Region-Text Pair Generation#

Alpha-CLIP#

Result#