[144] Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

paper , code

TL;DR

I read this because.. : Read thesis for a living…
task : MLLM
problem : multi-lingual MLLM that is also chinese. let’s also do finegrained task(grounding)
IDEA: Divide the training into three phases.
input/output : image, text -> text
architecture : ViT-G/14 + Q-former + Qwen-7B
objective : CE loss
baseline : Flamingo, UnifiedIO, Kosmos, BLIP-2, InstrcutBLIP, Shikra, Pix2Struct, …
data : captioning(LAION-en/zh, Datacomp, COYO, CC, SBU, COCO, in-house data), VQA(GQA, VGQA, VQAv2, DVQA, OCR-VQA, DocVQA, TextVQA, ChartQA, AI2D), Grounding(GRIT, VG, RefCOCO(+, g), OCR(synthDoG, Common Crawl…)), Pure-text (in-house)
evaluation : benchmarks, instruction-following benchmarks(TouchStone, SEED, MME)
result : sota
contribution : multi-lingual lvlm
etc. : Is the filtering strategy important? I also used text only data, but it didn’t import the training finished. Or maybe that contributed to better performance… It’s hard because it’s not ablated well in many ways.

Details

performance

architecture

256 was best

Inputs / Outputs

That’s a lot of extra instructions. A special token like <ref> or <box> is used, and no special token is used for bbox coordinates.

training pipeline

Changing hparam

resolution up / seq len up

Datasets that vary by stage

pre-training stage

Interesting that COYO has the highest survival rate among alt-text types. He said he only looked at the image once lol This filtering rule is documented in detail in the appendix, but you can find it in the appendix as follows

You’ve left a very strong clip score.

Multi-task Pre-training

Supervised Finetuning

This is also not detailed, but it is said that manual annotation, model generation, benchmark data concatenation, and multi-turn are done (I think it’s important…)

Result

Benchmark performance is omitted

instruction following benchmark

Few-shot ability

text only benchmark

I used Qwen LM as a learned interim, for no other reason than they were both being developed at about the same time lol.

TL;DR#

Details#

architecture#

Inputs / Outputs#

training pipeline#

pre-training stage#

Multi-task Pre-training#

Supervised Finetuning#

Result#

instruction following benchmark#

Few-shot ability#

text only benchmark#