TL;DR
- I read this because.. : VLM self-rewarding
- task : LVLM
- Problem :** LVLM suffers from object hallucination because it pays too much attention to the text token.
- idea : combine self rewarding + image relevance with CLIPScore to make rewarding image dependent
- architecture : LLaVA 1.5 7B / 13B
- objective : DPO loss
- baseline : LLaVA, RLHF-V, VLfeedback, …
- data : generated by iterating over iterations. seed is a randomly drawn subset 13K of the llava-instruction 150K data
- evaluation : VLM bench(MME, SEED, LLaVA_w, MMBench, …), VQA(SQA, VisWiz, GQA), Hall-bench(POPE, CHAIR)
- Result : VLM bench, VQA, hall-bench all improved
- contribution :
- etc. :
Details
Preliminary
LARGE LANGUAGE MODELS CAN SELF-IMPROVE https://arxiv.org/abs/2210.11610
Proposed
Generate samples with VLM (beam search decoding), assign a reward to each sentence, and score the entire sequence as the sum of these rewards. Pulling good/bad responses and learning from them for DPOs Generate samples etc. again with the trained VLM… Repeat this three times
Reward
Sum of Text score + image score
The $\lambda$ is a hyperparameter. Setting it to 0.9
- text score
$x$ : prompt $r_i$ : ith response token $s$ : sentence $R_t$: The text decoder part of LVLM.
Funny, only the sentence, not the image, not the previous sentence. The paper says instruction following score
- image score
CLIPScore.
Result
comparsion with other vlms
iterative with the result