image

paper

TL;DR

  • I read this because.. : reasoning ability in VLM
  • task : VLM
  • problem : VLM instruction data ๋Œ€๋ถ€๋ถ„์ด ๋‹จ๋ฌธ์ด๋‹ค
  • idea : GPT4-o๋ฅผ ๊ฐ€์ง€๊ณ  CoT ๋ฐ์ดํ„ฐ๋ฅผ ๋งŒ๋“ค์ž
  • architecture : LLaVA-NeXT
  • objective : CE loss -> DPO loss
  • baseline : LLaVA-NeXT, GPT4o, Cambrian, (data) RLAIF
  • data : ShareGPT4-o Reasoning(์•„์ง ๊ณต๊ฐœ ์•ˆํ•จ)
  • evaluation : A-OKVQA, DocVQA, ChartQA, AI2D, ScienceQA, …
  • result : ๋ชจ๋“  ๋ฒค์น˜์—์„œ ๊ณจ๊ณ ๋ฃจ ๋†’์€ ์„ฑ๋Šฅ.
  • contribution : ์ ์€ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ๋ฒค์น˜๋งˆํฌ ๊ฐœ์„ . reasoning ๊ด€๋ จ ๋ถ„์„ ๋งŽ์ด ํ•จ

Details

  • motivation image

Data

  • reasoning data distilation image
image image image

Result

image

์œ„์™€ ๊ฐ™์€ ๋ฐ์ดํ„ฐ ๊ตฌ์„ฑ

  • (1) format: ๋‹ต๋ณ€ ํฌ๋งท๋งŒ ๋งž์ถœ ์ˆ˜ ์žˆ๋Š” ์ˆ˜์ค€์œผ๋กœ ๊ตฌ์„ฑํ•œ ๊ฒƒ. 9๊ฐœ์˜ ๋ฐ์ดํ„ฐ์…‹ ๋ณ„๋กœ 50๊ฐœ์˜ sampling์„ ํ•จ. CoT / direct ๋‘˜๋‹ค + LLaVA-pretrain์—์„œ 2K
  • (2) direct data: (1) + ๋‹ต๋ณ€์ด ๋ฐ”๋กœ ๋‚˜์˜ค๋Š” 193K๋ฅผ Full๋กœ ๋„ฃ์€ ๊ฒƒ
  • (3) CoT data : (1) + CoT 193K๋ฅผ ๋„ฃ์€ ๊ฒƒ + ์ถ”๊ฐ€๋กœ GLLaVA-align / QA
  • (4) CoT SFT : (1) + direct + CoT ๋‘˜๋‹ค ๋„ฃ์€ ๊ฒƒ + ์ถ”๊ฐ€๋กœ GLLaVA-align / QA
image

CAN REASONING BE IMPLICITLY LEARNT FROM DIRECT PREDICTION? – (1)๊ณผ (2) ๋น„๊ต -> direct answer๋งŒ ๋„ฃ๊ณ  ํ•™์Šตํ•œ ๊ฒฝ์šฐ CoT infererence๋ฅผ ํ•  ๊ฒฝ์šฐ์— ๊ฐœ์„ ์ด ๋ฏธ๋ฏธํ•˜๊ฑฐ๋‚˜ ์˜คํžˆ๋ ค ๋–จ์–ด์ง€๋Š” ๊ฒฝ์šฐ๋„ ์žˆ์—ˆ์Œ(mathvista -1.7)

HOW EFFECTIVE IS COT REASONING DATA? – (3) chartQA๋‚˜ Mathvista๊ฐ™์ด ๊ณ„์‚ฐ์ด ๋งŽ์ด ๋“ค์–ด๊ฐ€๋Š” ๋ฒค์น˜๋งˆํฌ์—์„œ ์„ฑ๋Šฅ์ด ์˜ฌ๋ž๊ณ , ์˜์™ธ๋กœ TextVQA, DocVQA, InfoVQA ๊ฐ™์€ Text-heavyํ•œ ๋ฒค์น˜๋งˆํฌ์—์„œ๋„ ์„ฑ๋Šฅ์ด ์˜ค๋ฅด๋Š”๊ฑธ ๋ณผ ์ˆ˜ ์žˆ์Œ. – (4) CoT์™€ Direct ๋ชจ๋‘ ํ•™์Šต์„ ํ–ˆ์„ ๋•Œ ๊ฐ€์žฅ ํ‰๊ท  ์„ฑ๋Šฅ์ด ์ข‹์•˜๋‹ค. ๋‹ค๋งŒ TextVQA, DocVQA, AI2D๋Š” direct ์„ฑ๋Šฅ์ด ๋” ์ข‹์•˜๋‹ค. fact extraction ์œ„์ฃผ๋กœ ๋ฝ‘๋Š” ๋ฒค์น˜๋งˆํฌ์—ฌ์„œ ๊ทธ๋Ÿฐ ๊ฒƒ ๊ฐ™๋‹ค๊ณ  ์ถ”์ •.

ABLATION TESTS ON DATA COMPOSITION image

์ˆ˜ํ•™ ์ชฝ data ablation. text only sft๋Š” ๋ณ„๋กœ ํšจ๊ณผ๊ฐ€ ์—†์–ด์„œ ์ œ๊ฑฐํ–ˆ๋‹ค๊ณ  ํ•จ

image

science ์ชฝ ablation. ๋‘˜๋‹ค ๊ฐ™์ด ์“ฐ๋ฉด ์„œ๋กœ ์ข‹์•˜๋‹ค.

Comparsion of GPT4o / Cambrian image

ScienceQA๋Š” closed set ์ด ์„ฑ๋Šฅ์ด ์ข‹๋„ค. train data ๋ฌธ์ œ์ผ์ˆ˜๋„..

DPO Result

image image image

์™ธ์— BoN๋“ฑ ๋‚ด์šฉ์ด ๋” ๋งŽ์€๋ฐ ๋‚˜์ค‘์— ์ •๋ฆฌ ใ…œใ…œ