TL;DR
- I read this because.. : VLM RL early work. PPO μ¨μ.
- task : VLM + RL
- Problem :** hallucination in VLM
- idea : Let’s apply PPO! One difference is to add human annotations (captions, etc.) to the reward model.
- input/output : {image, question} -> answer
- architecture : LLaVA 7B (vicuna)
- objective : PPO loss
- baseline : OpenFlamingo, MiniGPT-4, InstructBLIP, LLaVA-SFT
- data : Create 10K samples with LLaVA SFT model and then create Human annotated preference data
- evaluation : MMBench, LLaVA-w, POPE, MMHal (proposed)
- result : Improved MMBench (finegrained perception)
- contribution : Almost the first study to attach RLHF to VLM
- etc. :
Details
Proposed
- humna preference data collection
Create 10K of LLaVA held-out data for SFT model with temperature 0.7 (image source?)
When receiving human preference annotations, the Instruction
Prompts to the RM model. additional captions, etc. do not factually augment augmented rlhf
MMHal-Bench
Create 12 questions with 96 quantities and 8 categories (object attribute, adversairal object, comparison, counting, spatial relation, environment, holistic, others). The image source is OpenImages and we give text-only GPT4 a human-generated answer about the image content, along with the category (presumably of Object) within the image. GPT4’s evaluation results show 94% agreement with human.
Result
LLaVA bench
mmhal bench
mmbench
Qualitative result
Ablation
- SFT data ablation
VQA data helps improve POPE