[209] SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

paper , page

TL;DR

I read this because.. : Is it bad to do too much SFT? + RL4VLM author follow-up study
task : card game(GeneralPoints), real-world navigation(V-IRL )
Problem :** Analysis of the data memorization phenomenon of SFT vs RL
Idea :** Create an out-of-distribution with a small change in rules or environment and analyze how performance changes.
input/output : {prompt, (image), previous prediction and result..} -> verifier output
architecture : Llama-3.2-Vision-11B
objective : SFT loss -> PPO loss
baseline : base model, (V-IRL) chatgpt, claude..
data : (SFT) Looks like expert data exists
evaluation : success rate
result : 1) in-domian is SFT > RL. SFT has worse OOD, but RL is maintained or improved 2) SFT should be for instruction following 3) putting it in sequential revision affects performance 4) V-IRL achieves sota
contribution : Systematically broken down into tasks that are not too complex and easy to understand.
etc. : Thanks for doing VLM too ðŸ™'

Details

thumbnail

task

GeneralPoints (a game that uses 4 cards to make 24 through the arithmetic operation) : LLM / VLM
- OOD
Viewing Q,K,V as 10 vs 11,12,13
Sampling from black cards / sampling from red cards
- V-IRL
The task of navigating around a city
- OOD :
The action changes to Turn left, etc.
Replace city
sequential revision input

training

SFT -> RL
RL is a PPO
No reasoning, just a straightforward return of the correct answer
verifier appears to be rule-based

result

ood performance

As learning progresses, OOD performance increases from RL > SFT SFT deteriorates significantly with nothing maintained

result for visual OOD

SFT is necessary for RL training when the backbone model does not follow instructions.
Scaling up verification improves generalization.

+2.15% (3 steps), +2.99% (5 steps), +5.99% (10 steps). <-> one verification step, we only observe a marginal improvement of +0.48% in OOD performance improvement.

TL;DR#

Details#

task#

training#

result#

TL;DR

Details

task

training

result