[144] Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

TL;DR

I read this because.. : 생계형 논문 읽기..
task : MLLM
problem : chinese도 되는 multi-lingual MLLM. finegrained task(grounding)도 하자
idea : training 단계를 세개로 나눠서 학습.
input/output : image, text -> text
architecture : ViT-G/14 + Q-former + Qwen-7B
objective : CE loss
baseline : Flamingo, UnifiedIO, Kosmos, BLIP-2, InstrcutBLIP, Shikra, Pix2Struct, …
data : captioning(LAION-en/zh, Datacomp, COYO, CC, SBU, COCO, in-house data), VQA(GQA, VGQA, VQAv2, DVQA, OCR-VQA, DocVQA, TextVQA, ChartQA, AI2D), Grounding(GRIT, VG, RefCOCO(+, g), OCR(synthDoG, Common Crawl…)), Pure-text (in-house)
evaluation : benchmarks, instruction-following benchmarks(TouchStone, SEED, MME)
result : sota
contribution : multi-lingual lvlm
etc. : filtering 전략이 중요한건가? text only data도 썼는데 학습이 다 완료된걸 안가져와서.. 아닌가 그게 오히려 성능 더 좋아지는데 기여했나.. 여러모로 뭔가 ablation이 잘안돼서 어렵군