
paper
, code
TL;DR#
- I read this because.. : https://github.com/long8v/PTIR/issues/139
was mentioned in a conversation about
- task : LVLM
- problem : LIMBeR-like model that can output image -> model that can retreival in interleaved image-text
- idea : LIMBeR, but put a
[RET] token at the end to make it retreivalable. - input/output : image + text (concat at random with 50% probability) + image + text -> free form of text
- architecture : CLIP ViT-L/14 + OPT (6.7B) and train only
[RET] with a linear function connecting the vision output (5.5M trainable parameter). - objective : captioning loss + retrieval loss
- baseline : CLIP ViT-L/14, BLIP, Flamingo, ViLBERT, ESPER
- data : (train) CC3M -> (eval) VisualDialogue, Visual Story
- evaluation : IT2T(image/text-to-text, text-to-image)R@k, NDCG, MRR, story generation human evaluation
- result : single retrieval performs worse than CLIP, but image - text
- contribution : Adding new features with minimal training, no need to learn interleaved data like Flamingo!
- etc. : CLIP text encoder is bidirectional, what does that mean? It started with CLIP, so I don’t think it’s that great to beat CLIP.
Details#

- result

