[179] Aligning Large Multimodal Models with Factually Augmented RLHF

paper , code

TL;DR

I read this because.. : VLM RL early work. PPO 써서.
task : VLM + RL
Problem :** hallucination in VLM
idea : Let’s apply PPO! One difference is to add human annotations (captions, etc.) to the reward model.
input/output : {image, question} -> answer
architecture : LLaVA 7B (vicuna)
objective : PPO loss
baseline : OpenFlamingo, MiniGPT-4, InstructBLIP, LLaVA-SFT
data : Create 10K samples with LLaVA SFT model and then create Human annotated preference data
evaluation : MMBench, LLaVA-w, POPE, MMHal (proposed)
result : Improved MMBench (finegrained perception)
contribution : Almost the first study to attach RLHF to VLM
etc. :

Details

Proposed

humna preference data collection Create 10K of LLaVA held-out data for SFT model with temperature 0.7 (image source?) When receiving human preference annotations, the Instruction

Prompts to the RM model. additional captions, etc. do not factually augment augmented rlhf

MMHal-Bench

Create 12 questions with 96 quantities and 8 categories (object attribute, adversairal object, comparison, counting, spatial relation, environment, holistic, others). The image source is OpenImages and we give text-only GPT4 a human-generated answer about the image content, along with the category (presumably of Object) within the image. GPT4’s evaluation results show 94% agreement with human.

Result

LLaVA bench
mmhal bench
mmbench

Qualitative result

Ablation

SFT data ablation

VQA data helps improve POPE

TL;DR#

Details#

Proposed#

MMHal-Bench#

Result#

Qualitative result#

Ablation#