[209] SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

paper , page

TL;DR

I read this because.. : SFT를 너무 많이 하는게 안좋나? + RL4VLM 저자 후속 연구
task : card game(GeneralPoints), real-world navigation(V-IRL )
problem : SFT vs RL의 data memorization 현상에 대한 분석
idea : rule이나 환경을 조금 바꾼 out-of-distribution을 만든 뒤 성능이 어떻게 바뀌는지 분석
input/output : {prompt, (image), previous prediction and result..} -> verifier output
architecture : Llama-3.2-Vision-11B
objective : SFT loss -> PPO loss
baseline : base model, (V-IRL) chatgpt, claude..
data : (SFT) expert data가 있다는 것 같음
evaluation : success rate
result : 1) in-domian은 SFT > RL. SFT는 OOD가 떨어지는데 RL은 유지되거나 개선됨 2) instruction following을 하기 위한 SFT는 되어야 함 3) sequential revision으로 넣어주는게 성능에 영향 4) V-IRL은 sota 달성
contribution : 너무 복잡하지 않고 이해하기 쉬운 Task로 systemically 분석
etc. : VLM도 해줘서 고마웡 ㅜ

Details

thumbnail

task

GeneralPoints (4개의 카드를 사용하여 사칙연산을 통해 24를 만드는게임) : LLM / VLM
- OOD
  - Q,K,V를 10으로 보기 vs 11,12,13으로 보기
  - 검정색 카드에서 sampling / 빨간색 카드에서 sampling
V-IRL
- city 돌아다니면서 navigation 하는 태스크
- OOD :
  - action이 왼쪽으로 돌기 등으로 바뀜.
  - city를 바꿈
sequential revision input

training

SFT -> RL
RL은 PPO
reasoning은 따로 없고 바로 정답 return하는 형태임
verifier는 rule-based로 보임

result

ood performance

학습이 진행됨에 따라 OOD 성능이 RL > SFT SFT는 유지되는 것 없이 크게 악화됨

visual OOD에 대한 result

SFT is necessary for RL training when the backbone model does not follow instructions.
Scaling up verification improves generalization.

+2.15% (3 steps), +2.99% (5 steps), +5.99% (10 steps). <-> one verification step, we only observe a marginal improvement of +0.48% in OOD performance improvement.

TL;DR#

Details#

task#

training#

result#

TL;DR

Details

task

training

result