image

paper , code

TL;DR

  • I read this because.. : Read thesis for a living…
  • task : MLLM
  • problem : multi-lingual MLLM that is also chinese. let’s also do finegrained task(grounding)
  • IDEA: Divide the training into three phases.
  • input/output : image, text -> text
  • architecture : ViT-G/14 + Q-former + Qwen-7B
  • objective : CE loss
  • baseline : Flamingo, UnifiedIO, Kosmos, BLIP-2, InstrcutBLIP, Shikra, Pix2Struct, …
  • data : captioning(LAION-en/zh, Datacomp, COYO, CC, SBU, COCO, in-house data), VQA(GQA, VGQA, VQAv2, DVQA, OCR-VQA, DocVQA, TextVQA, ChartQA, AI2D), Grounding(GRIT, VG, RefCOCO(+, g), OCR(synthDoG, Common Crawl…)), Pure-text (in-house)
  • evaluation : benchmarks, instruction-following benchmarks(TouchStone, SEED, MME)
  • result : sota
  • contribution : multi-lingual lvlm
  • etc. : Is the filtering strategy important? I also used text only data, but it didn’t import the training finished. Or maybe that contributed to better performance… It’s hard because it’s not ablated well in many ways.

Details

  • performance image

architecture

image

256 was best image

Inputs / Outputs

image

That’s a lot of extra instructions. A special token like <ref> or <box> is used, and no special token is used for bbox coordinates.

training pipeline

image
  • Changing hparam image

resolution up / seq len up

  • Datasets that vary by stage

pre-training stage

image

Interesting that COYO has the highest survival rate among alt-text types. He said he only looked at the image once lol This filtering rule is documented in detail in the appendix, but you can find it in the appendix as follows

image

You’ve left a very strong clip score.

Multi-task Pre-training

image

Supervised Finetuning

This is also not detailed, but it is said that manual annotation, model generation, benchmark data concatenation, and multi-turn are done (I think it’s important…) image

Result

Benchmark performance is omitted

instruction following benchmark

image

Few-shot ability

image

text only benchmark

image

I used Qwen LM as a learned interim, for no other reason than they were both being developed at about the same time lol.