[149] Noise-aware Learning from Web-crawled Image-Text Data for Image Captioning

TL;DR

I read this because.. : aka noc. something seemed to have done a good job analyzing the CLIP score.
task : captioning with noisy image-text label
Problem :** Data like COCO and Visual Genome are not scalable. But using web-crawled pairs can be noisy, and filtering them by CLIP score makes a lot of data disappear.
idea : Binning and embedding CLIP scores to provide when captioning, and giving the best aligned score in the inference step.
input/output : image, clip score of {image, text} pair -> text
architecture : CLIP ViT-L/14 + 6-layer transformer(94.5M)
objective : cross-entropy loss
baseline : no filtering, filtering(clip score 0.3), loss reweighting(loss multiplied by clip score), ZeroCap, Socratic Model, DeCAP
data: CC3M (on the noisy axis.!), also tried COYO with ablation
evaluation : BLEU, METEOR, CIDEr, SPICE, CLIPScore for COCO, nocaps // self-retrieval R@1 (does the image come up when retrieved with a caption created for a specific image)
result : BLEU minus sota
contribution : Simple and intuitive~.
etc. : I didn’t get what I wanted, but I enjoyed reading it~ The most similar thing is BLIP, but I thought about it, so…. BLIP seems to be a pioneering study