image

paper , code

TL;DR

  • I read this because.. : mathvista ๊ฐœ์„ ์„ ์œ„ํ•ด
  • task : LVLM
  • problem : ๊ธฐ์กด์˜ math ๊ด€๋ จ LVLM work์ธ G-LLaVA, Math-LLaVA๋Š” ๊ฐ๊ฐ geometric reasoning ๋Šฅ๋ ฅ์— ์ œํ•œ, CoT ๋Šฅ๋ ฅ์— ์ œํ•œ์ด๋ผ๋Š” ๋‹จ์ ์ด ์žˆ๋‹ค
  • idea : ๋‹ค์–‘ํ•œ ์ˆ˜ํ•™ ๋ถ„์•ผ + CoT๋ฅผ ์ถ”๊ฐ€ํ•œ ๋ฐ์ดํ„ฐ์…‹์„ ๋งŒ๋“ค์ž
  • architecture : llava (clip-vit-large, DeepSeekMath-RL)
  • objective : ce loss + ppo loss
  • baseline : closed LLMs, LLMs, Math LLMs, Open-Source MLLMs(G-LLaVA-7B, Math-LLaVA-13B, LLaVa-1.5-7B, LLaVA-NeXT-34B)
  • data : (align) LLaVA-Pretrain + geo170k-align (instruct) LLaVA-instruct (math instruct) MultiMath300k-instuction, Geo170k-qa, MathV360k (PPO) MultiMath300K-val, GSM8K-train, Math-train, CMATH-train
  • evaluation : Mathvista, Mathverse, GSM8K, MATH, CMATH, GaoKao
  • result : open source model ์ค‘ ๊ฐ€์žฅ ๋†’์€ mathvista, mathverse ์„ฑ๋Šฅ, text math ๋ฒค์น˜์—์„œ๋„ ๋‹ค๋ฅธ MLLM๊ณผ ๋น„๊ตํ•ด๋ดค์„ ๋•Œ sota.
  • contribution : ๋ฐ์ดํ„ฐ์…‹ ์ œ์•ˆ ๋ฐ text/vision ๋‘˜๋‹ค ๋†’์€ ์„ฑ๋Šฅ
  • etc. : ๋‚ด์šฉ์€ ๋ป”ํ•  ์ˆ˜ ์žˆ์ง€๋งŒ ๋ถ„์„์ด ๋งŽ์•„์„œ ์žฌ๋ฐŒ์—ˆ๋‹ค

Details

Thumbnail

image

proposed MultiMath-300K

image image

  • ์ง์ ‘ ์ด๋ฏธ์ง€ license ์‚ฌ์„œ ์ œ์ž‘(http://test.xuekubao.com/ )
  • QA ๋ฟ ์•„๋‹ˆ๋ผ captioning ๋˜์–ด ์žˆ๋Š” ๊ฒƒ๋„ ์žˆ์Œ
  • geomertry problem solving, automatic theorem proving, mathematical word problems ๋ชจ๋‘ ์ปค๋ฒ„
  • ์˜์–ด/์ค‘๊ตญ์–ด๋ผ๊ณ  ํ•˜๋Š”๋ฐ ๊ฑฐ์˜ ์ค‘๊ตญ์–ด ์ธ๋“ฏ..?
  • CoT ์ปค๋ฒ„

image

์ˆ˜์ง‘ ๋ฐฉ๋ฒ•

image

  • round 1: GPT-4o๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ step-by-step reasoning chains๋ฅผ ์ƒ์„ฑ. ์›๋ณธ ๋ฐ์ดํ„ฐ๋ฅผ ํžŒํŠธ๋กœ ์‚ฌ์šฉ
  • round 2: GPT4-o๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ƒ์„ฑ๋œ reasoning chain์ด standard answer์™€ ๋น„๊ตํ–ˆ์„ ๋•Œ ์ž˜ ์ƒ์„ฑ๋๋Š”์ง€ ํ‰๊ฐ€. inconsistentํ•˜๋‹ค๋ฉด reasoning step์„ ์ˆ˜์ •
  • round 3: GPT-4o ๋‹ต๋ณ€๊ณผ standard answer๋ฅผ ์‚ฌ์šฉํ•œ ๋’ค ๋งž๋Š” ์ •๋‹ต๋งŒ ์‚ฌ์šฉ.

training

  • (align) LLaVA-Pretrain + geo170k-align : 1 epoch
  • (instruct) LLaVA-instruct : ViT๋„ full tuning
  • (math instruct) MultiMath300k-instuction, Geo170k-qa, MathV360k
  • (PPO) MultiMath300K-val, GSM8K-train, Math-train, CMATH-train๋ฅผ ์†Œ์Šค๋กœ ๋งŒ๋“ฆ

Process-supervised RL image

  • CoT reasoning ์‹œ์ผœ์„œ multiple reasoning step์„ ์ƒ์„ฑํ•˜๊ฒŒ ํ•จ
  • GPT-4oํ•œํ…Œ correctness๋ฅผ ํ‰๊ฐ€ํ•˜๊ฒŒ ํ•˜๊ณ  ์—๋Ÿฌ๊ฐ€ ๋ฐœ์ƒํ•œ step์„ ์ฐพ์•„์„œ ๋งž๋Š” solution์„ ์ƒ์„ฑํ•˜๊ฒŒ ํ•จ
  • ์ด๊ฑธ๋กœ prefer / disprefer set์ด ๋‚˜์˜ด -> RM ํ•™์Šต PPO
  • ๊ฐ actor๋ชจ๋ธ์ด ์ƒ์„ฑํ•œ reasoning step์— ๋Œ€ํ•œ reward score๋ฅผ ๊ฐ€์ง€๊ณ  PPO ํ•™์Šต

Result

image

closed model ๋ณด๋‹จ ์•„๋‹ˆ์ง€๋งŒ open source model ์ค‘ ๊ฐ€์žฅ ๋†’์€ ์„ฑ๋Šฅ

text ์„ฑ๋Šฅ

image

๋‹ค๋ฅธ math ์˜คํ”ˆ์†Œ์Šค ํŠนํ™” ๋ชจ๋ธ๋“ค์ด LLaVA-NeXT๋ณด๋‹ค ์•ˆ์ข‹์Œ.

contribution of RL

image

PPO ๋‹จ๊ณ„์—์„œ ์“ฐ์˜€๋˜ ๋„๋ฉ”์ธ์ธ cmath, gsm8k, math ๊ฐœ์„ , ์“ฐ์ด์ง€ ์•Š์€๊ฑด ๊ฐœ์„  ์•ˆ๋จ. mathvista์˜ ๊ฒฝ์šฐ 0.8 ์˜ฌ๋ž๊ณ  (align, sft๋Š” ๊ฐ๊ฐ 1.3, 1.6 ์˜ฌ๋ฆผ)mathverse์˜ ๊ฒฝ์šฐ 0.2 ๋–จ์–ด์ง

LLM backbone

image

vicuna ๋Œ€๋น„ ์„ฑ๋Šฅ์ฐจ์ด๊ฐ€ ๋งŽ์ด ๋‚จ. MathVista 42.9 vs 50.0 ใ„ทใ„ท MultiMath๊ฐ€ ์ค‘๊ตญ์–ด๊ฐ€ ๋Œ€๋ถ€๋ถ„์ธ ํƒ“๋„ ์กฐ๊ธˆ ์žˆ์„ ๋“ฏ. ๊ทธ๋ž˜๋„ table 3๋ณด๋ฉด ํ•™์Šต์ด ์•ˆ๋œ๊ฑด ์•„๋‹˜.