[171] CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs

TL;DR

I read this because.. : recommended by google scholar
task : VLM + RLHF
problem : I want to solve the hallucination of VLM, but can’t I create data for DPO training on the cheap?
idea : CLIP score to create with?
input/output : {image, question} -> score
architecture : MobileVLM-v2), LLaVA 1.5
objective : DPO loss
baseline : BLIP-2, InstructBLIP, Shira, OpenFlamingo, Qwn-VL … ShareGPT4V, HA-DPO as DPO technique
data : Image source is SFT, created with MobileVLM-v2, filtered by CLIP score and heuristics. Create win / lose pairs based on CLIP score of 2 or more.
evaluation : AMBER , classification evaluated by CLIP (tell it to generate captions, then do zero-shot classification with siglip), VLM benchs (GQA, SQA, VQA, MME, MMB)
result : AMBER improvement. Not QwenVL, GPT4V, but AMBER sota. Other benchmarks don’t worsen performance, and SQA and MMB don’t improve?
contribution : Create DPO data on the cheap.
etc. :

why CLIP? Create a hallucination as shown below and then compare CLIP vs LLaVA 1.5 logit

bar = logit assigned larger for hallucinated caption (dark blue llava 1.5 / light blue CLIP)

CLIP pulls out hallucinated objects, attributes, and relations better than VLM!

generation: created in two forms using lightweight VLM (MobileVLM-v2 family in the paper)

Ask Mistral 7B to create questions and correct and incorrect answers from images

CLIP ranking: CLIPScore in a nutshell
Global filtering :
Remove an image containing text because it has a high CLIPScore
Remove below CLIPScore threshold
Remove long captions
Remove low CLIPScore for questions (e.g. “what is the main object in the image?”)
Pair filtering :
For QA, subtract the description of the image from Q with a regex, concat with the answer, and filter out the low CLIPScore (?)
Only those with a CLIPScore difference of 2 or more
As long as the caption lengths are not too different

We end up with 750K pairs – 50K of which are QA and 700K of which are captioned