image

paper

TL;DR

  • I read this because.. : recommended by google scholar
  • task : VLM + RLHF
  • problem : I want to solve the hallucination of VLM, but can’t I create data for DPO training on the cheap?
  • idea : CLIP score to create with?
  • input/output : {image, question} -> score
  • architecture : MobileVLM-v2), LLaVA 1.5
  • objective : DPO loss
  • baseline : BLIP-2, InstructBLIP, Shira, OpenFlamingo, Qwn-VL … ShareGPT4V, HA-DPO as DPO technique
  • data : Image source is SFT, created with MobileVLM-v2, filtered by CLIP score and heuristics. Create win / lose pairs based on CLIP score of 2 or more.
  • evaluation : AMBER , classification evaluated by CLIP (tell it to generate captions, then do zero-shot classification with siglip), VLM benchs (GQA, SQA, VQA, MME, MMB)
  • result : AMBER improvement. Not QwenVL, GPT4V, but AMBER sota. Other benchmarks don’t worsen performance, and SQA and MMB don’t improve?
  • contribution : Create DPO data on the cheap.
  • etc. :

Details

  • why CLIP? Create a hallucination as shown below and then compare CLIP vs LLaVA 1.5 logit image
image

bar = logit assigned larger for hallucinated caption (dark blue llava 1.5 / light blue CLIP)

CLIP pulls out hallucinated objects, attributes, and relations better than VLM!

  • CLIP-DPO No change to the DPO algorithm, just the data pool image

  • data image

  1. generation: created in two forms using lightweight VLM (MobileVLM-v2 family in the paper)
  • generic caption Asking Mobile VLM v2 models to create a caption. Use 5 prompts

  • per-image QA image

Ask Mistral 7B to create questions and correct and incorrect answers from images

  1. data annotation
  • CLIP ranking: CLIPScore in a nutshell

  • Global filtering :

  • Remove an image containing text because it has a high CLIPScore

  • Remove below CLIPScore threshold

  • Remove long captions

  • Remove low CLIPScore for questions (e.g. “what is the main object in the image?”) image

  • Pair filtering :

  • For QA, subtract the description of the image from Q with a regex, concat with the answer, and filter out the low CLIPScore (?)

  • Only those with a CLIPScore difference of 2 or more

  • As long as the caption lengths are not too different

We end up with 750K pairs – 50K of which are QA and 700K of which are captioned

image

Result

image image image