TL;DR
- I read this because.. : Read thesis for a living…
- task : MLLM
- problem : multi-lingual MLLM that is also chinese. let’s also do finegrained task(grounding)
- IDEA: Divide the training into three phases.
- input/output : image, text -> text
- architecture : ViT-G/14 + Q-former + Qwen-7B
- objective : CE loss
- baseline : Flamingo, UnifiedIO, Kosmos, BLIP-2, InstrcutBLIP, Shikra, Pix2Struct, …
- data : captioning(LAION-en/zh, Datacomp, COYO, CC, SBU, COCO, in-house data), VQA(GQA, VGQA, VQAv2, DVQA, OCR-VQA, DocVQA, TextVQA, ChartQA, AI2D), Grounding(GRIT, VG, RefCOCO(+, g), OCR(synthDoG, Common Crawl…)), Pure-text (in-house)
- evaluation : benchmarks, instruction-following benchmarks(TouchStone, SEED, MME)
- result : sota
- contribution : multi-lingual lvlm
- etc. : Is the filtering strategy important? I also used text only data, but it didn’t import the training finished. Or maybe that contributed to better performance… It’s hard because it’s not ablated well in many ways.
Details
- performance
architecture
256 was best
Inputs / Outputs
That’s a lot of extra instructions.
A special token like <ref> or <box> is used, and no special token is used for bbox coordinates.
training pipeline
- Changing hparam
resolution up / seq len up
- Datasets that vary by stage
pre-training stage
Interesting that COYO has the highest survival rate among alt-text types. He said he only looked at the image once lol This filtering rule is documented in detail in the appendix, but you can find it in the appendix as follows
You’ve left a very strong clip score.
Multi-task Pre-training
Supervised Finetuning
This is also not detailed, but it is said that manual annotation, model generation, benchmark data concatenation, and multi-turn are done (I think it’s important…)
Result
Benchmark performance is omitted
instruction following benchmark
Few-shot ability
text only benchmark
I used Qwen LM as a learned interim, for no other reason than they were both being developed at about the same time lol.