image

paper

TL;DR

  • I read this because.. : VLM self-rewarding
  • task : LVLM
  • Problem :** LVLM suffers from object hallucination because it pays too much attention to the text token.
  • idea : combine self rewarding + image relevance with CLIPScore to make rewarding image dependent
  • architecture : LLaVA 1.5 7B / 13B
  • objective : DPO loss
  • baseline : LLaVA, RLHF-V, VLfeedback, …
  • data : generated by iterating over iterations. seed is a randomly drawn subset 13K of the llava-instruction 150K data
  • evaluation : VLM bench(MME, SEED, LLaVA_w, MMBench, …), VQA(SQA, VisWiz, GQA), Hall-bench(POPE, CHAIR)
  • Result : VLM bench, VQA, hall-bench all improved
  • contribution :
  • etc. :

Details

Preliminary

LARGE LANGUAGE MODELS CAN SELF-IMPROVE https://arxiv.org/abs/2210.11610

Proposed

image image

Generate samples with VLM (beam search decoding), assign a reward to each sentence, and score the entire sequence as the sum of these rewards. Pulling good/bad responses and learning from them for DPOs Generate samples etc. again with the trained VLM… Repeat this three times

Reward

Sum of Text score + image score image

The $\lambda$ is a hyperparameter. Setting it to 0.9

  • text score image

$x$ : prompt $r_i$ : ith response token $s$ : sentence $R_t$: The text decoder part of LVLM.

Funny, only the sentence, not the image, not the previous sentence. The paper says instruction following score

  • image score image

CLIPScore.

Result

image
  • comparsion with other vlms image

  • iterative with the result image

image image

ablations

image