[149] Noise-aware Learning from Web-crawled Image-Text Data for Image Captioning

TL;DR

I read this because.. : aka noc. 뭔가 CLIP score에 대해 분석을 잘 했을 것 같아서 읽음.
task : captioning with noisy image-text label
problem : COCO, Visual Genome 같은 데이터는 scalable하지 않음. 그렇다고 web-crawled pair를 쓰자니 noisy할 수 있고 이걸 CLIP score로 filtering 하자니 또 데이터의 상당수가 사라짐.
idea : CLIP score를 binning 한 뒤 임베딩하여 captioning할 때 제공하게 하고 inference 단계에서는 가장 잘 align 된 score를 주고 추론하게 함
input/output : image, clip score of {image, text} pair -> text
architecture : CLIP ViT-L/14 + 6-layer transformer(94.5M)
objective : cross-entropy loss
baseline : no filtering, filtering(clip score 0.3), loss reweighting(loss에 clip score를 곱해줌), ZeroCap, Socratic Model, DeCAP
data : CC3M (noisy한 축에 속하는 구나.!), ablation으로 COYO도 해봄
evaluation : COCO, nocaps에 대해 BLEU, METEOR, CIDEr, SPICE, CLIPScore // self-retrieval R@1(특정 이미지로 생성한 caption으로 retrieval 했을 때 그 이미지가 나오는지)
result : BLEU 빼고 sota
contribution : 간단하고 직관적임~
etc. : 원하는 건 못 얻었지만 재밌게 읽었다~ 가장 비슷한건 BLIP이라는데 생각해보니까 그럼.. BLIP 참 선구적인 연구인듯