image

paper , code

TL;DR

  • I read this because.. : aka noc. something seemed to have done a good job analyzing the CLIP score.
  • task : captioning with noisy image-text label
  • Problem :** Data like COCO and Visual Genome are not scalable. But using web-crawled pairs can be noisy, and filtering them by CLIP score makes a lot of data disappear.
  • idea : Binning and embedding CLIP scores to provide when captioning, and giving the best aligned score in the inference step.
  • input/output : image, clip score of {image, text} pair -> text
  • architecture : CLIP ViT-L/14 + 6-layer transformer(94.5M)
  • objective : cross-entropy loss
  • baseline : no filtering, filtering(clip score 0.3), loss reweighting(loss multiplied by clip score), ZeroCap, Socratic Model, DeCAP
  • data: CC3M (on the noisy axis.!), also tried COYO with ablation
  • evaluation : BLEU, METEOR, CIDEr, SPICE, CLIPScore for COCO, nocaps // self-retrieval R@1 (does the image come up when retrieved with a caption created for a specific image)
  • result : BLEU minus sota
  • contribution : Simple and intuitive~.
  • etc. : I didn’t get what I wanted, but I enjoyed reading it~ The most similar thing is BLIP, but I thought about it, so…. BLIP seems to be a pioneering study

Details

  • motivation image image
image
  • architecture image

  • results image

  • ablations image

  • qualitative image