[173] Detecting and Preventing Hallucinations in Large Vision Language Models

TL;DR

I read this because.. : VLM + RLHF
task : LVLM
problem : hallucination
idea : get human annotation at segment level to measure hallucination + learn like rejection sampling / DPO
input/output : {image, question} -> class(accurate, inaccurate, analysis)
architecture : InstructBLIP
objective : CE loss or proposed FDPO loss
baseline : InstructBLIP, LLaVA, mPLUG-OWL
data : (proposed) 16K image-prompt-response
evaluation : RM Score(NLL for true segments), human eval(percent of content that was truthful? Sentence-by-sentence…
result : Improved performance when training the reward model and rejection sampling. The proposed FDPO also improved performance.
contribution : Benchmarks published, pretty early work on RLHF for VLM
etc. : MHALDetect benchmarks are well done, so there are a lot of citations, but something doesn’t read well…

As shown below, the annotation

4000 images - instructBLIP response (10 human annotated) 4 classes: accurate, inaccurate, analysis, and unsure

val split 3200 of them –> this is probably the MHALDetect

Multi-Modal Reward Model Using Instruct BLIP. Learning by attaching a classifier (accurate, inaccurate, analysis) to each sentence level eos token. For a segment-level reward model, I put a classifier at the end of each segment (which just goes on and on until I look at the data and see a different label). Not sure why I did this..!
Rejection sampling I don’t have a proper explanation, but it seems like it’s sampling from the inference and then using the negative log likelihood value in the RM model at each sentence level to determine whether there is a hallucination or not. best-of-n, worst-of-n. where n is 16, 64
fine-grained direct preference optimization Unlike DPO, in this case we don’t have a pair so we just impose the loss at the segment level

RM Score doesn’t cut it…. Improved performance on Human Eval. No other hallucination benches or VLM benches were taken.