[172] RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback

paper , data , code

TL;DR

I read this because.. : VLM + RLHF
task : MLLM
Problem :** Hallucination problem in MLLM. In the case of GPT4-V, 45.9% were hallucinations.
Idea :** Let’s learn DPO, but let’s make sure to correctly answer which segment is wrong.
input/output : {image, question} -> answer
architecture : The authors’ previous work, Muffin . Model based on BEiT-3 + 13B Vicnuna 1.0
objective : Slightly modified DPO, with slightly different weights for the log-propb part of the DPO loss term.
baseline : QwenVL-Chat, LLaVA, LLaVA1.5, Muffin, InstructBLIP, LLaVA-RLHF
data : human annotated 1.4K data
evaluation : Object HalBench, MMHAL-Bench, MHumanEval, LLaVA Bench, VQAv2
result : SOTA among open models in terms of hallucination (beats some GPT4Vs). For LLAVA Bench, LLavA-RLHF is better, but tied.
contribution : Efficient DPO learning. Data disclosure
etc. :

Details

overall

underlying challenges in human preference data

ambiguity There are two answers, each with its own advantages and disadvantages, and the question is which one to favor.
learning efficiency It is difficult to learn because it needs to give feedback for a long answer with one response, so it requires a lot of data, and this credit misallocation problem leads to problems such as reward hacking.

fine-grained correctional human preference collection

Human annotation at the segment level. Correcting hallucinated segments. Before/after correction becomes $y_w$, $y_l$. In this case, the data comes from the instruction data source by making the image description prompt as GPT4 (?) and the answer is received via muffin (??).

The resulting data statistic is 64.4 words with 2.65 corrected segments. The hallucination types were objects (41.2%), positions (20.3%), numbers (16.5%), attributes (10.0%), actions (5.3%), and misc.