TL;DR
- I read this because.. : Is it bad to do too much SFT? + RL4VLM author follow-up study
- task : card game(GeneralPoints), real-world navigation(V-IRL )
- Problem :** Analysis of the data memorization phenomenon of SFT vs RL
- Idea :** Create an out-of-distribution with a small change in rules or environment and analyze how performance changes.
- input/output : {prompt, (image), previous prediction and result..} -> verifier output
- architecture : Llama-3.2-Vision-11B
- objective : SFT loss -> PPO loss
- baseline : base model, (V-IRL) chatgpt, claude..
- data : (SFT) Looks like expert data exists
- evaluation : success rate
- result : 1) in-domian is SFT > RL. SFT has worse OOD, but RL is maintained or improved 2) SFT should be for instruction following 3) putting it in sequential revision affects performance 4) V-IRL achieves sota
- contribution : Systematically broken down into tasks that are not too complex and easy to understand.
- etc. : Thanks for doing VLM too ðŸ™'
Details
- thumbnail
task
GeneralPoints (a game that uses 4 cards to make 24 through the arithmetic operation) : LLM / VLM
- OOD
Viewing Q,K,V as 10 vs 11,12,13
Sampling from black cards / sampling from red cards
- V-IRL
- V-IRL
The task of navigating around a city
- OOD :
The action changes to Turn left, etc.
Replace city
sequential revision input
training
- SFT -> RL
- RL is a PPO
- No reasoning, just a straightforward return of the correct answer
- verifier appears to be rule-based
result
- ood performance
As learning progresses, OOD performance increases from RL > SFT SFT deteriorates significantly with nothing maintained
- result for visual OOD
SFT is necessary for RL training when the backbone model does not follow instructions.
Scaling up verification improves generalization.
+2.15% (3 steps), +2.99% (5 steps), +5.99% (10 steps). <-> one verification step, we only observe a marginal improvement of +0.48% in OOD performance improvement.