image

paper , code , dataset

TL;DR

  • I read this because.. : ์ถ”์ฒœ ๋ฐ›์•„
  • task : reasoning in LVLM
  • problem : LVLM๋„ gpt-o1 ์ฒ˜๋Ÿผ reasoning ๊ธธ๊ฒŒ ํ•˜๊ณ  ์‹ถ๋‹ค
  • idea : ๋ฐ์ดํ„ฐ ๋„ฃ๊ณ  ํ•™์Šตํ•˜์ž. ๋Œ€๋‹ต์˜ ๋‹จ๊ณ„๋ฅผ ๋‚˜๋ˆ„์ž. ๋Œ€๋‹ต ๋‹จ๊ณ„ ๋ณ„๋กœ beam search๋ฅผ ํ•˜์ž
  • architecture : Llama 3.2V
  • objective : CE loss (SFT ํ›„ futher SFT)
  • baseline : Llama 3.2V
  • data : Llava-CoT-100k (proposed)
  • evaluation : mmstar, mmbench, mmvet, mathvista, ai2d,
  • result : ๊ฐœ์„ ๋œ ์„ฑ๋Šฅ.
  • contribution : ๋ฐ์ดํ„ฐ ๊ณต๊ฐœ.

Details

  • thumbnail image

  • inference examples image

  • ๋‹ต๋ณ€ ๊ตฌ์กฐํ™” ๋ฐฉ์‹ image

GPT4oํ•œํ…Œ ์ƒ์„ฑ์‹œํ‚จ ๋’ค ๊ตฌ์กฐ๋ฅผ ์•ˆ๋งž์ถ”๋Š” ๊ฒƒ Filtering. <summary>, </summary> ํƒœ๊ทธ ์•ˆ์— ์žˆ๋Š” ๊ฒƒ๋“ค์„ Gt answer๋ž‘ ๋น„๊ตํ•ด์„œ ์ž˜ ๋‹ต๋ณ€ํ•œ๊ฑด์ง€ ํ•„ํ„ฐ๋ง์„ ๋˜ GPT4oํ•œํ…Œ ์‹œํ‚ด image

image
  • ์ƒ์„ฑํ•œ ์ด๋ฏธ์ง€ ์†Œ์Šค image

https://github.com/long8v/PTIR/issues/203 ์–˜๋ž‘ ์†Œ์Šค ๊ฒน์นจ image

  • ๊ฐ ๊ตฌ์กฐ์— ๋Œ€ํ•œ beam search ์ง„ํ–‰ image

“beam search"๋ผ๊ณ  ํ•ด์„œ ๋ชฐ๋ž๋Š”๋ฐ External verifier๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ํ˜•ํƒœ์ธ๋“ฏ. ์ด๋•Œ ์‚ฌ์šฉ๋œ Prompt? ์–ด๋–ค ๋ชจ๋ธ์„ ์‚ฌ์šฉํ–ˆ๋Š”์ง€ ๋ชป๋ดค์Œ image

  • Training hparam image

Result

image

๋‚˜๋ฆ„ “Reasoning ๋ฒค์น˜๋งˆํฌ"๋ผ๋Š”๊ฑธ ์„ ์ •. direct training์€ ์›๋ž˜ vqa set์œผ๋กœ further SFTํ•œ ๊ฒƒ. w/o structured tag๋Š” <summary> ๊ฐ™์€ ํƒœ๊ทธ ์‚ฌ์šฉํ•˜์ง€ ์•Š์€ ๊ฒƒ mmstar, mmvet, mathvista๋Š” ๊ฐœ์„ . ai2d๋Š” ๊ทธ๋ƒฅ Direct๋กœ ๋‹ต๋ณ€๋งŒ ํ•™์Šตํ•˜๋Š”๊ฒŒ ๋” ์„ฑ๋Šฅ์ด ์ข‹์Œ

image

mmstar์—์„œ ์„ธ๋ถ€ ํ•ญ๋ชฉ์„ ๋ณด๋ฉด reasoning ๊ด€๋ จ ์„ธ๋ถ€ํ•ญ๋ชฉ๊ณผ math, science ๋“ฑ์ด ์˜ค๋ฆ„. perception์€ ์•ˆ์˜ค๋ฅด๋Š”๊ฑด ์•„๋‹Œ๋ฐ ๋ฏธ๋ฏธํ•จ.

  • stage level beam search image

RM ํ•™์Šต ํ–ˆ๋‹ค๊ณ  ํ•˜๋Š” ์–˜๊ธฐ ์—†๋Š”๋ฐ BoN์€ ์–ด๋–ป๊ฒŒ ํ•œ๊ฑธ๊นŒ? image

  • comparison with other models image