image

paper , code

TL;DR

  • I read this because.. : https://github.com/long8v/PTIR/issues/139 was mentioned in a conversation about
  • task : LVLM
  • problem : LIMBeR-like model that can output image -> model that can retreival in interleaved image-text
  • idea : LIMBeR, but put a [RET] token at the end to make it retreivalable.
  • input/output : image + text (concat at random with 50% probability) + image + text -> free form of text
  • architecture : CLIP ViT-L/14 + OPT (6.7B) and train only [RET] with a linear function connecting the vision output (5.5M trainable parameter).
  • objective : captioning loss + retrieval loss
  • baseline : CLIP ViT-L/14, BLIP, Flamingo, ViLBERT, ESPER
  • data : (train) CC3M -> (eval) VisualDialogue, Visual Story
  • evaluation : IT2T(image/text-to-text, text-to-image)R@k, NDCG, MRR, story generation human evaluation
  • result : single retrieval performs worse than CLIP, but image - text
  • contribution : Adding new features with minimal training, no need to learn interleaved data like Flamingo!
  • etc. : CLIP text encoder is bidirectional, what does that mean? It started with CLIP, so I don’t think it’s that great to beat CLIP.

Details

image image image image
  • result image
image image image