TL;DR
- I read this because.. : recommended by google scholar
- task : VLM + RLHF
- problem : I want to solve the hallucination of VLM, but can’t I create data for DPO training on the cheap?
- idea : CLIP score to create with?
- input/output : {image, question} -> score
- architecture : MobileVLM-v2), LLaVA 1.5
- objective : DPO loss
- baseline : BLIP-2, InstructBLIP, Shira, OpenFlamingo, Qwn-VL … ShareGPT4V, HA-DPO as DPO technique
- data : Image source is SFT, created with MobileVLM-v2, filtered by CLIP score and heuristics. Create win / lose pairs based on CLIP score of 2 or more.
- evaluation : AMBER , classification evaluated by CLIP (tell it to generate captions, then do zero-shot classification with siglip), VLM benchs (GQA, SQA, VQA, MME, MMB)
- result : AMBER improvement. Not QwenVL, GPT4V, but AMBER sota. Other benchmarks don’t worsen performance, and SQA and MMB don’t improve?
- contribution : Create DPO data on the cheap.
- etc. :
Details
- why CLIP?
Create a hallucination as shown below and then compare CLIP vs LLaVA 1.5 logit
bar = logit assigned larger for hallucinated caption (dark blue llava 1.5 / light blue CLIP)
CLIP pulls out hallucinated objects, attributes, and relations better than VLM!
CLIP-DPONo change to the DPO algorithm, just the data pooldata
- generation: created in two forms using lightweight VLM (MobileVLM-v2 family in the paper)
generic caption Asking Mobile VLM v2 models to create a caption. Use 5 prompts
per-image QA
Ask Mistral 7B to create questions and correct and incorrect answers from images
- data annotation
CLIP ranking: CLIPScore in a nutshell
Global filtering :
Remove an image containing text because it has a high CLIPScore
Remove below CLIPScore threshold
Remove long captions
Remove low CLIPScore for questions (e.g. “what is the main object in the image?”)
Pair filtering :
For QA, subtract the description of the image from Q with a regex, concat with the answer, and filter out the low CLIPScore (?)
Only those with a CLIPScore difference of 2 or more
As long as the caption lengths are not too different
We end up with 750K pairs – 50K of which are QA and 700K of which are captioned