image

paper , code

TL;DR

  • I read this because.. : VLM RL early work. PPO μ¨μ„œ.
  • task : VLM + RL
  • Problem :** hallucination in VLM
  • idea : Let’s apply PPO! One difference is to add human annotations (captions, etc.) to the reward model.
  • input/output : {image, question} -> answer
  • architecture : LLaVA 7B (vicuna)
  • objective : PPO loss
  • baseline : OpenFlamingo, MiniGPT-4, InstructBLIP, LLaVA-SFT
  • data : Create 10K samples with LLaVA SFT model and then create Human annotated preference data
  • evaluation : MMBench, LLaVA-w, POPE, MMHal (proposed)
  • result : Improved MMBench (finegrained perception)
  • contribution : Almost the first study to attach RLHF to VLM
  • etc. :

Details

Proposed

image
  • humna preference data collection Create 10K of LLaVA held-out data for SFT model with temperature 0.7 (image source?) When receiving human preference annotations, the Instruction image

Prompts to the RM model. additional captions, etc. do not factually augment augmented rlhf image

MMHal-Bench

Create 12 questions with 96 quantities and 8 categories (object attribute, adversairal object, comparison, counting, spatial relation, environment, holistic, others). The image source is OpenImages and we give text-only GPT4 a human-generated answer about the image content, along with the category (presumably of Object) within the image. GPT4’s evaluation results show 94% agreement with human.

Result

  • LLaVA bench image

  • mmhal bench image

  • mmbench

image

Qualitative result

image

Ablation

  • SFT data ablation image

VQA data helps improve POPE