image

paper , data , code

TL;DR

  • I read this because.. : VLM + RLHF
  • task : MLLM
  • Problem :** Hallucination problem in MLLM. In the case of GPT4-V, 45.9% were hallucinations.
  • Idea :** Let’s learn DPO, but let’s make sure to correctly answer which segment is wrong.
  • input/output : {image, question} -> answer
  • architecture : The authors’ previous work, Muffin . Model based on BEiT-3 + 13B Vicnuna 1.0
  • objective : Slightly modified DPO, with slightly different weights for the log-propb part of the DPO loss term.
  • baseline : QwenVL-Chat, LLaVA, LLaVA1.5, Muffin, InstructBLIP, LLaVA-RLHF
  • data : human annotated 1.4K data
  • evaluation : Object HalBench, MMHAL-Bench, MHumanEval, LLaVA Bench, VQAv2
  • result : SOTA among open models in terms of hallucination (beats some GPT4Vs). For LLAVA Bench, LLavA-RLHF is better, but tied.
  • contribution : Efficient DPO learning. Data disclosure
  • etc. :

Details

overall

image

underlying challenges in human preference data

  1. ambiguity There are two answers, each with its own advantages and disadvantages, and the question is which one to favor.
  2. learning efficiency It is difficult to learn because it needs to give feedback for a long answer with one response, so it requires a lot of data, and this credit misallocation problem leads to problems such as reward hacking.

fine-grained correctional human preference collection

Human annotation at the segment level. Correcting hallucinated segments. Before/after correction becomes $y_w$, $y_l$. In this case, the data comes from the instruction data source by making the image description prompt as GPT4 (?) and the answer is received via muffin (??).

The resulting data statistic is 64.4 words with 2.65 corrected segments. The hallucination types were objects (41.2%), positions (20.3%), numbers (16.5%), attributes (10.0%), actions (5.3%), and misc.

Dense Direct Preference Optimization

  • DPO loss recap image

($\beta$ 0.5)

Here, the proposed DDPO is to weight the log-prob part according to whether it belongs to the corrected segment ($y_c$) or not (unchanged, $y_u$).

image
  • $\gamma$ : 5
  • $N$: len($y_u$) + $\gamma$ len($y_c$)
  • 1/N is there to control for a preference for longer responses that are more likely to be longer

Result

image image image image image

Ablations

image image