TL;DR
- I read this because.. : aka noc. something seemed to have done a good job analyzing the CLIP score.
- task : captioning with noisy image-text label
- Problem :** Data like COCO and Visual Genome are not scalable. But using web-crawled pairs can be noisy, and filtering them by CLIP score makes a lot of data disappear.
- idea : Binning and embedding CLIP scores to provide when captioning, and giving the best aligned score in the inference step.
- input/output : image, clip score of {image, text} pair -> text
- architecture : CLIP ViT-L/14 + 6-layer transformer(94.5M)
- objective : cross-entropy loss
- baseline : no filtering, filtering(clip score 0.3), loss reweighting(loss multiplied by clip score), ZeroCap, Socratic Model, DeCAP
- data: CC3M (on the noisy axis.!), also tried COYO with ablation
- evaluation : BLEU, METEOR, CIDEr, SPICE, CLIPScore for COCO, nocaps // self-retrieval R@1 (does the image come up when retrieved with a caption created for a specific image)
- result : BLEU minus sota
- contribution : Simple and intuitive~.
- etc. : I didn’t get what I wanted, but I enjoyed reading it~ The most similar thing is BLIP, but I thought about it, so…. BLIP seems to be a pioneering study
Details
- motivation
architecture
results
ablations
qualitative