image

paper , code

TL;DR

  • I read this because.. : VLM RL ์ดˆ๊ธฐ์ž‘. PPO ์จ์„œ.
  • task : VLM + RL
  • problem : VLM์˜ hallucination
  • idea : PPO ์ ์šฉํ•ด๋ณด์ž! ํ•œ๊ฐ€์ง€ ๋‹ค๋ฅธ ์ ์€ reward model์— human annotation(caption ๋“ฑ)์„ ์ถ”๊ฐ€๋กœ ๋„ฃ์–ด์ฃผ์ž
  • input/output : {image, question} -> answer
  • architecture : LLaVA 7B (vicuna)
  • objective : PPO loss
  • baseline : OpenFlamingo, MiniGPT-4, InstructBLIP, LLaVA-SFT
  • data : LLaVA SFT ๋ชจ๋ธ๋กœ 10K sample์„ ๋งŒ๋“  ๋’ค Human annotated preference data๋งŒ๋“ฆ
  • evaluation : MMBench, LLaVA-w, POPE, MMHal (proposed)
  • result : MMBench ๊ฐœ์„  (finegrained perception)
  • contribution : VLM์— RLHF๋ฅผ ๋ถ™์ธ ๊ฑฐ์˜ ์ฒ˜์Œ ์—ฐ๊ตฌ
  • etc. :

Details

Proposed

image
  • humna preference data collection temperature 0.7๋กœ SFT ๋ชจ๋ธ์— ๋Œ€ํ•ด 10K์˜ LLaVA held-out ๋ฐ์ดํ„ฐ๋ฅผ ๋งŒ๋“ฆ (์ด๋ฏธ์ง€ ์†Œ์Šค๋Š”?) human prefernce annotation ๋ฐ›์„ ๋•Œ Instruction image

RM model์—๊ฒŒ ์ฃผ๋Š” prompt. ์ถ”๊ฐ€์ ์œผ๋กœ caption ๋“ฑ์„ ์คฌ๋‹ค๊ณ  ํ•ด์„œ factually augmented rlhf image

MMHal-Bench

์ˆ˜๋Ÿ‰์€ 96๊ฐœ์ด๊ณ  8๊ฐœ์˜ ์นดํ…Œ๊ณ ๋ฆฌ(object attribute, adversairal object, comparsion, counting, spatial relation, environment, holistic, others)์— ๋Œ€ํ•ด 12๊ฐœ ์งˆ๋‹ต์„ ๋งŒ๋“ฆ. ์ด๋ฏธ์ง€ ์†Œ์Šค๋Š” OpenImages์ด๊ณ  text-only GPT4์—๊ฒŒ ์ด๋ฏธ์ง€ ์ปจํ…์ธ ์— ๋Œ€ํ•œ ์‚ฌ๋žŒ์ด ์ƒ์„ฑํ•œ ๋‹ต๋ณ€๊ณผ ์ด๋ฏธ์ง€ ๋‚ด์— ์žˆ๋Š” (์•„๋งˆ Object์˜) ์นดํ…Œ๊ณ ๋ฆฌ๋„ ๊ฐ™์ด ์คŒ. gpt4์˜ ํ‰๊ฐ€ ๊ฒฐ๊ณผ๋Š” human๊ณผ 94% ์ผ์น˜ํ•จ.

Result

  • LLaVA bench image

  • mmhal bench image

  • mmbench

image

Qualitative result

image

Ablation

  • SFT data ablation image

VQA ๋ฐ์ดํ„ฐ๊ฐ€ POPE ๊ฐœ์„ ์— ๋„์›€