image

paper , page , demo

TL;DR

  • I read this because.. : Sunghyun recommended. region caption / detailed caption 생성에도 쓰인 것 같고 fine-grained clip에 대한 궁금증이 있어서 봄.
  • task : CLIP with mask
  • Problem :** CLIP pulls information globally, but I want a finer understanding. How can I understand the full context of the image and not distort the image itself?
  • idea : attach conv operation before ViT in CLIP, RGB conv and alpha conv separately, maybe feature summarize and pass to ViT
  • input/output : (clip) image + mask, text -> similarity
  • architecture : CLIP
  • objective : contrastive loss
  • baseline : (image classification) CLIP, Red Circle, MaskCLIP (REC) CPT, ReCLIP, Red Circle (OVD) MaskImageNet, Detic-ImageNet. (MMLM) LLaVA-1.5, BLIP-2 , …
  • data : generate rgba - region text with additional pipeline for GRIT-20m + ImageNet 460K
  • evaluation : For each benchmark… Sometimes MMLM just swaps the backbone (which is possible by freezing the text encoder), sometimes it’s finetuned
  • Result : Improved imagenet performance, reduced MLLM hallucinations, etc.
  • contribution : Simple architecture + seems to have tackled a lot of problems without much learning.
  • etc. : What do we do with SAM? Region level clips do reduce hallucinations.

Details

motivation

image
  • image recognition: better classification (imagenet is single-label but actually multi-label) / can be used as referring expression comprehension (REC) / can be used for data generation for OVDs
  • The backbone of MLLM: reducing hallucination and model bias
  • generation: allows you to cherry-pick the parts you want and fixes problems with multi objects

Region-focusing strategy

image

The image itself is distorted or full contextual information is omitted/deleted

RGBA Region-Text Pair Generation

image - Grounding data pipeline: GRiT data already has bounding boxes and region text. Run SAM on it to get masks - Classification data pipeline: SAM -> crop -> assign as clip score -> caption as BLIP and attach class as well

Alpha-CLIP

image
  • text encoder freeze
  • Add RGB conv + alpha conv
  • alpha is between 0 and 1, but we want it to initially start at 0, so we use the
  • And alpha + rgb conv is elementwise summation (not shown)
  • 10% sample of epoxies trained with image - text pairs

Result

  • ImageNet classification image

imagenet-s has semantic segmentation hanging on imagenet, but performance improved when I gave that gt to alpha Or improved when given as a bbox, or just looking at the whole image as a mask, no performance drop (as if it’s learning with new data)

  • REC image
image

The pipeline is kinda weird, lol… SAM picks a bunch of masks and finds the one that’s closest to the text, and that’s the answer.

  • OVD image

Using a pseudo-labeling approach + AlphaCLIP as a backbone yields better performance

  • Region-level captioning Just swapping out the backbone is working. image

The results of our finetuned quantitative evaluation are as follows image