Image

paper , page

TL;DR

  • I read this because.. : SFT๋ฅผ ๋„ˆ๋ฌด ๋งŽ์ด ํ•˜๋Š”๊ฒŒ ์•ˆ์ข‹๋‚˜? + RL4VLM ์ €์ž ํ›„์† ์—ฐ๊ตฌ
  • task : card game(GeneralPoints), real-world navigation(V-IRL )
  • problem : SFT vs RL์˜ data memorization ํ˜„์ƒ์— ๋Œ€ํ•œ ๋ถ„์„
  • idea : rule์ด๋‚˜ ํ™˜๊ฒฝ์„ ์กฐ๊ธˆ ๋ฐ”๊พผ out-of-distribution์„ ๋งŒ๋“  ๋’ค ์„ฑ๋Šฅ์ด ์–ด๋–ป๊ฒŒ ๋ฐ”๋€Œ๋Š”์ง€ ๋ถ„์„
  • input/output : {prompt, (image), previous prediction and result..} -> verifier output
  • architecture : Llama-3.2-Vision-11B
  • objective : SFT loss -> PPO loss
  • baseline : base model, (V-IRL) chatgpt, claude..
  • data : (SFT) expert data๊ฐ€ ์žˆ๋‹ค๋Š” ๊ฒƒ ๊ฐ™์Œ
  • evaluation : success rate
  • result : 1) in-domian์€ SFT > RL. SFT๋Š” OOD๊ฐ€ ๋–จ์–ด์ง€๋Š”๋ฐ RL์€ ์œ ์ง€๋˜๊ฑฐ๋‚˜ ๊ฐœ์„ ๋จ 2) instruction following์„ ํ•˜๊ธฐ ์œ„ํ•œ SFT๋Š” ๋˜์–ด์•ผ ํ•จ 3) sequential revision์œผ๋กœ ๋„ฃ์–ด์ฃผ๋Š”๊ฒŒ ์„ฑ๋Šฅ์— ์˜ํ–ฅ 4) V-IRL์€ sota ๋‹ฌ์„ฑ
  • contribution : ๋„ˆ๋ฌด ๋ณต์žกํ•˜์ง€ ์•Š๊ณ  ์ดํ•ดํ•˜๊ธฐ ์‰ฌ์šด Task๋กœ systemically ๋ถ„์„
  • etc. : VLM๋„ ํ•ด์ค˜์„œ ๊ณ ๋งˆ์›ก ใ…œ

Details

  • thumbnail
Image

task

  • GeneralPoints (4๊ฐœ์˜ ์นด๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์‚ฌ์น™์—ฐ์‚ฐ์„ ํ†ตํ•ด 24๋ฅผ ๋งŒ๋“œ๋Š”๊ฒŒ์ž„) : LLM / VLM

    • Image
    • OOD
      • Q,K,V๋ฅผ 10์œผ๋กœ ๋ณด๊ธฐ vs 11,12,13์œผ๋กœ ๋ณด๊ธฐ
      • ๊ฒ€์ •์ƒ‰ ์นด๋“œ์—์„œ sampling / ๋นจ๊ฐ„์ƒ‰ ์นด๋“œ์—์„œ sampling
  • V-IRL

    • Image
    • city ๋Œ์•„๋‹ค๋‹ˆ๋ฉด์„œ navigation ํ•˜๋Š” ํƒœ์Šคํฌ
    • OOD :
      • action์ด ์™ผ์ชฝ์œผ๋กœ ๋Œ๊ธฐ ๋“ฑ์œผ๋กœ ๋ฐ”๋€œ.
      • city๋ฅผ ๋ฐ”๊ฟˆ
  • sequential revision input Image

training

  • SFT -> RL
  • RL์€ PPO
  • reasoning์€ ๋”ฐ๋กœ ์—†๊ณ  ๋ฐ”๋กœ ์ •๋‹ต returnํ•˜๋Š” ํ˜•ํƒœ์ž„
  • verifier๋Š” rule-based๋กœ ๋ณด์ž„
    • Image

result

  • ood performance Image

ํ•™์Šต์ด ์ง„ํ–‰๋จ์— ๋”ฐ๋ผ OOD ์„ฑ๋Šฅ์ด RL > SFT SFT๋Š” ์œ ์ง€๋˜๋Š” ๊ฒƒ ์—†์ด ํฌ๊ฒŒ ์•…ํ™”๋จ

Image
  • visual OOD์— ๋Œ€ํ•œ result
Image
  • SFT is necessary for RL training when the backbone model does not follow instructions. Image

  • Scaling up verification improves generalization.

Image

+2.15% (3 steps), +2.99% (5 steps), +5.99% (10 steps). <-> one verification step, we only observe a marginal improvement of +0.48% in OOD performance improvement.