image

paper , code

TL;DR

  • I read this because.. : ์ƒ๊ณ„ํ˜• ๋…ผ๋ฌธ ์ฝ๊ธฐ..
  • task : MLLM
  • problem : chinese๋„ ๋˜๋Š” multi-lingual MLLM. finegrained task(grounding)๋„ ํ•˜์ž
  • idea : training ๋‹จ๊ณ„๋ฅผ ์„ธ๊ฐœ๋กœ ๋‚˜๋ˆ ์„œ ํ•™์Šต.
  • input/output : image, text -> text
  • architecture : ViT-G/14 + Q-former + Qwen-7B
  • objective : CE loss
  • baseline : Flamingo, UnifiedIO, Kosmos, BLIP-2, InstrcutBLIP, Shikra, Pix2Struct, …
  • data : captioning(LAION-en/zh, Datacomp, COYO, CC, SBU, COCO, in-house data), VQA(GQA, VGQA, VQAv2, DVQA, OCR-VQA, DocVQA, TextVQA, ChartQA, AI2D), Grounding(GRIT, VG, RefCOCO(+, g), OCR(synthDoG, Common Crawl…)), Pure-text (in-house)
  • evaluation : benchmarks, instruction-following benchmarks(TouchStone, SEED, MME)
  • result : sota
  • contribution : multi-lingual lvlm
  • etc. : filtering ์ „๋žต์ด ์ค‘์š”ํ•œ๊ฑด๊ฐ€? text only data๋„ ์ผ๋Š”๋ฐ ํ•™์Šต์ด ๋‹ค ์™„๋ฃŒ๋œ๊ฑธ ์•ˆ๊ฐ€์ ธ์™€์„œ.. ์•„๋‹Œ๊ฐ€ ๊ทธ๊ฒŒ ์˜คํžˆ๋ ค ์„ฑ๋Šฅ ๋” ์ข‹์•„์ง€๋Š”๋ฐ ๊ธฐ์—ฌํ–ˆ๋‚˜.. ์—ฌ๋Ÿฌ๋ชจ๋กœ ๋ญ”๊ฐ€ ablation์ด ์ž˜์•ˆ๋ผ์„œ ์–ด๋ ต๊ตฐ

Details

  • performance image

architecture

image

256์ด ๊ฐ€์žฅ ์ข‹์•˜๋‹ค๊ณ  ํ•จ image

Inputs / Outputs

image

๋ณ„๋„์˜ instruction์ด ํฌ๊ฒŒ ์•ˆ์“ฐ์˜€๊ตฐ. <ref>๋‚˜ <box>๊ฐ™์€ special token์ด ์“ฐ์˜€๊ณ  ์•ˆ์— bbox ์ขŒํ‘œ ๊ฐ™์€๊ฑด ๋”ฐ๋กœ ์ŠคํŽ˜์…œ ํ† ํฐ ์•ˆ์ผ๋‹ค๊ณ  ํ•จ

training pipeline

image
  • ๋‹ฌ๋ผ์ง€๋Š” hparam image

resolution up / seq len up

  • stage๋งˆ๋‹ค ๋‹ฌ๋ผ์ง€๋Š” ๋ฐ์ดํ„ฐ์…‹

pre-training stage

image

COYO๊ฐ€ alt-text๋ฅ˜ ์ค‘์— ๊ฐ€์žฅ ์‚ด์•„๋‚จ์€ ๋น„์œจ์ด ๋†’์€๊ฒŒ ํฅ๋ฏธ๋กญ๊ตฐ ๋”ฑ ์ด๋ฏธ์ง€ ํ•œ๋ฒˆ์”ฉ๋งŒ ๋ดค๋‹ค๊ณ  ํ•จ ใ…‹ใ…‹ ์ด filtering rule์€ ์ž์„ธํ•˜๊ฒŒ ์•ˆ์ ํ˜€์žˆ๋Š”๋ฐ appendix์—์„œ ์•„๋ž˜์™€ ๊ฐ™์ด

image

clip score๋ฅผ ์•„์ฃผ ๊ฐ•ํ•˜๊ฒŒ ๋‚จ๊ฒผ๋‹ค๊ณ  ํ•˜๋„น..

Multi-task Pre-training

image

Supervised Finetuning

์ด๊ฒƒ๋„ ์ž์„ธํžˆ ์•ˆ๋‚˜์™€ ์žˆ๋Š”๋ฐ manual annotation, model generation, ๋ฒค์น˜๋งˆํฌ ๋ฐ์ดํ„ฐ concatํ•ด๊ฐ€์ง€๊ณ  multi-turn์œผ๋กœ ๋งŒ๋“ค์—ˆ๋‹ค๊ณ  ํ•จ (์ค‘์š”ํ•œ ๊ฒƒ ๊ฐ™์€๋ฐ.. ใ…œใ…œ) image

Result

๋ฒค์น˜๋งˆํฌ ์„ฑ๋Šฅ๋“ค์€ ์ƒ๋žต

instruction following benchmark

image

Few-shot ability

image

text only benchmark

image

Qwen LM์„ ํ•™์Šต๋œ ์ค‘๊ฐ„ ๊ป„ ์ผ๋Š”๋ฐ ๋‹ค๋ฅธ ์ด์œ ๋Š” ์—†๊ณ  ๊ทธ๋ƒฅ ๋‘˜์ด ๊ฑฐ์˜ ๋™์‹œ์— ๊ฐœ๋ฐœ์ค‘์ด์—ˆ๋‹ค๊ณ  ใ…‹ใ…‹