[179] Aligning Large Multimodal Models with Factually Augmented RLHF

paper , code

TL;DR

I read this because.. : VLM RL 초기작. PPO 써서.
task : VLM + RL
problem : VLM의 hallucination
idea : PPO 적용해보자! 한가지 다른 점은 reward model에 human annotation(caption 등)을 추가로 넣어주자
input/output : {image, question} -> answer
architecture : LLaVA 7B (vicuna)
objective : PPO loss
baseline : OpenFlamingo, MiniGPT-4, InstructBLIP, LLaVA-SFT
data : LLaVA SFT 모델로 10K sample을 만든 뒤 Human annotated preference data만듦
evaluation : MMBench, LLaVA-w, POPE, MMHal (proposed)
result : MMBench 개선 (finegrained perception)
contribution : VLM에 RLHF를 붙인 거의 처음 연구
etc. :

Details

Proposed

humna preference data collection temperature 0.7로 SFT 모델에 대해 10K의 LLaVA held-out 데이터를 만듦 (이미지 소스는?) human prefernce annotation 받을 때 Instruction

RM model에게 주는 prompt. 추가적으로 caption 등을 줬다고 해서 factually augmented rlhf

MMHal-Bench

수량은 96개이고 8개의 카테고리(object attribute, adversairal object, comparsion, counting, spatial relation, environment, holistic, others)에 대해 12개 질답을 만듦. 이미지 소스는 OpenImages이고 text-only GPT4에게 이미지 컨텐츠에 대한 사람이 생성한 답변과 이미지 내에 있는 (아마 Object의) 카테고리도 같이 줌. gpt4의 평가 결과는 human과 94% 일치함.

Result

LLaVA bench
mmhal bench
mmbench

Qualitative result

Ablation

SFT data ablation

VQA 데이터가 POPE 개선에 도움

TL;DR#

Details#

Proposed#

MMHal-Bench#

Result#

Qualitative result#

Ablation#