[129] Grounding Language Models to Images for Multimodal Inputs and Outputs

TL;DR

I read this because.. : https://github.com/long8v/PTIR/issues/139 관련된 얘기하다가 언급되어
task : LVLM
problem : LIMBeR류인데 image output할 수 있는 모델 -> interleaved image-text에서 retreival을 할 수 있는 모델
idea : LIMBeR인데 마지막에 [RET] 토큰을 넣어서 retreival 가능하게.
input/output : image + text (50%의 확률로 랜덤으로 concat) + image + text -> free form of text
architecture : CLIP ViT-L/14 + OPT(6.7B)이고 vision output을 이어주는 linear function과 [RET]만 학습(5.5M trainable parameter).
objective : captioning loss + retrieval loss
baseline : CLIP ViT-L/14, BLIP, Flamingo, ViLBERT, ESPER
data : (train) CC3M -> (eval) VisualDialogue, Visual Story
evaluation : IT2T(image/text-to-text, text-to-image)R@k, NDCG, MRR, story generation human evaluation
result : single retrieval은 CLIP보다 성능이 낮지만 image - text
contribution : Flamingo처럼 별도의 interleaved data 학습 없이도 최소한의 학습으로 새로운 기능 추가!
etc. : CLIP text encoder가 bidirectional이라는게 무슨 말이징.. CLIP에서 시작한거라 CLIP을 이긴건 그렇게 대단한건 아닌가 싶기도..