Image

paper , page

TL;DR

  • I read this because.. : Is it bad to do too much SFT? + RL4VLM author follow-up study
  • task : card game(GeneralPoints), real-world navigation(V-IRL )
  • Problem :** Analysis of the data memorization phenomenon of SFT vs RL
  • Idea :** Create an out-of-distribution with a small change in rules or environment and analyze how performance changes.
  • input/output : {prompt, (image), previous prediction and result..} -> verifier output
  • architecture : Llama-3.2-Vision-11B
  • objective : SFT loss -> PPO loss
  • baseline : base model, (V-IRL) chatgpt, claude..
  • data : (SFT) Looks like expert data exists
  • evaluation : success rate
  • result : 1) in-domian is SFT > RL. SFT has worse OOD, but RL is maintained or improved 2) SFT should be for instruction following 3) putting it in sequential revision affects performance 4) V-IRL achieves sota
  • contribution : Systematically broken down into tasks that are not too complex and easy to understand.
  • etc. : Thanks for doing VLM too ðŸ™'

Details

  • thumbnail
Image

task

  • GeneralPoints (a game that uses 4 cards to make 24 through the arithmetic operation) : LLM / VLM

    • Image
    • OOD
  • Viewing Q,K,V as 10 vs 11,12,13

  • Sampling from black cards / sampling from red cards

    • V-IRL
      • Image
  • The task of navigating around a city

    • OOD :
  • The action changes to Turn left, etc.

  • Replace city

  • sequential revision input Image

training

  • SFT -> RL
  • RL is a PPO
  • No reasoning, just a straightforward return of the correct answer
  • verifier appears to be rule-based
    • Image

result

  • ood performance Image

As learning progresses, OOD performance increases from RL > SFT SFT deteriorates significantly with nothing maintained

Image
  • result for visual OOD
Image
  • SFT is necessary for RL training when the backbone model does not follow instructions. Image

  • Scaling up verification improves generalization.

Image

+2.15% (3 steps), +2.99% (5 steps), +5.99% (10 steps). <-> one verification step, we only observe a marginal improvement of +0.48% in OOD performance improvement.