[129] Grounding Language Models to Images for Multimodal Inputs and Outputs

TL;DR

I read this because.. : https://github.com/long8v/PTIR/issues/139 was mentioned in a conversation about
task : LVLM
problem : LIMBeR-like model that can output image -> model that can retreival in interleaved image-text
idea : LIMBeR, but put a [RET] token at the end to make it retreivalable.
input/output : image + text (concat at random with 50% probability) + image + text -> free form of text
architecture : CLIP ViT-L/14 + OPT (6.7B) and train only [RET] with a linear function connecting the vision output (5.5M trainable parameter).
objective : captioning loss + retrieval loss
baseline : CLIP ViT-L/14, BLIP, Flamingo, ViLBERT, ESPER
data : (train) CC3M -> (eval) VisualDialogue, Visual Story
evaluation : IT2T(image/text-to-text, text-to-image)R@k, NDCG, MRR, story generation human evaluation
result : single retrieval performs worse than CLIP, but image - text
contribution : Adding new features with minimal training, no need to learn interleaved data like Flamingo!
etc. : CLIP text encoder is bidirectional, what does that mean? It started with CLIP, so I don’t think it’s that great to beat CLIP.