TL;DR
- I read this because.. : VLM + RLHF
- task : LVLM
- problem : hallucination
- idea : get human annotation at segment level to measure hallucination + learn like rejection sampling / DPO
- input/output : {image, question} -> class(accurate, inaccurate, analysis)
- architecture : InstructBLIP
- objective : CE loss or proposed FDPO loss
- baseline : InstructBLIP, LLaVA, mPLUG-OWL
- data : (proposed) 16K image-prompt-response
- evaluation : RM Score(NLL for true segments), human eval(percent of content that was truthful? Sentence-by-sentence…
- result : Improved performance when training the reward model and rejection sampling. The proposed FDPO also improved performance.
- contribution : Benchmarks published, pretty early work on RLHF for VLM
- etc. : MHALDetect benchmarks are well done, so there are a lot of citations, but something doesn’t read well…
Details
As shown below, the annotation
4000 images - instructBLIP response (10 human annotated) 4 classes: accurate, inaccurate, analysis, and unsure
val split 3200 of them –> this is probably the MHALDetect
Method
Multi-Modal Reward Model Using Instruct BLIP. Learning by attaching a classifier (accurate, inaccurate, analysis) to each sentence level eos token. For a segment-level reward model, I put a classifier at the end of each segment (which just goes on and on until I look at the data and see a different label). Not sure why I did this..!
Rejection sampling I don’t have a proper explanation, but it seems like it’s sampling from the inference and then using the negative log likelihood value in the RM model at each sentence level to determine whether there is a hallucination or not. best-of-n, worst-of-n. where n is 16, 64
fine-grained direct preference optimization Unlike DPO, in this case we don’t have a pair so we just impose the loss at the segment level
- $x$ : tokens before the current segment
- $y$ : generated segment
- $c$ : class of current segment
- 1 : preferred classs (correct)
- 0: dispreferred class (incorrect, optionally also analysis)
Result
Performance of reward models
rejection sampling / finegrained DPO result
RM Score doesn’t cut it…. Improved performance on Human Eval. No other hallucination benches or VLM benches were taken.