image

paper

TL;DR

  • I read this because.. : VLM + RLHF
  • task : LVLM
  • problem : hallucination
  • idea : get human annotation at segment level to measure hallucination + learn like rejection sampling / DPO
  • input/output : {image, question} -> class(accurate, inaccurate, analysis)
  • architecture : InstructBLIP
  • objective : CE loss or proposed FDPO loss
  • baseline : InstructBLIP, LLaVA, mPLUG-OWL
  • data : (proposed) 16K image-prompt-response
  • evaluation : RM Score(NLL for true segments), human eval(percent of content that was truthful? Sentence-by-sentence…
  • result : Improved performance when training the reward model and rejection sampling. The proposed FDPO also improved performance.
  • contribution : Benchmarks published, pretty early work on RLHF for VLM
  • etc. : MHALDetect benchmarks are well done, so there are a lot of citations, but something doesn’t read well…

Details

As shown below, the annotation image

4000 images - instructBLIP response (10 human annotated) 4 classes: accurate, inaccurate, analysis, and unsure

val split 3200 of them –> this is probably the MHALDetect

Method

  • Multi-Modal Reward Model Using Instruct BLIP. Learning by attaching a classifier (accurate, inaccurate, analysis) to each sentence level eos token. For a segment-level reward model, I put a classifier at the end of each segment (which just goes on and on until I look at the data and see a different label). Not sure why I did this..!

  • Rejection sampling I don’t have a proper explanation, but it seems like it’s sampling from the inference and then using the negative log likelihood value in the RM model at each sentence level to determine whether there is a hallucination or not. best-of-n, worst-of-n. where n is 16, 64 image

  • fine-grained direct preference optimization Unlike DPO, in this case we don’t have a pair so we just impose the loss at the segment level

image
  • $x$ : tokens before the current segment
  • $y$ : generated segment
  • $c$ : class of current segment
    • 1 : preferred classs (correct)
  • 0: dispreferred class (incorrect, optionally also analysis)

Result

  • Performance of reward models image

  • rejection sampling / finegrained DPO result image

RM Score doesn’t cut it…. Improved performance on Human Eval. No other hallucination benches or VLM benches were taken.