[182] Calibrated Self-Rewarding Vision Language Models

paper

TL;DR

I read this because.. : VLM self-rewarding
task : LVLM
Problem :** LVLM suffers from object hallucination because it pays too much attention to the text token.
idea : combine self rewarding + image relevance with CLIPScore to make rewarding image dependent
architecture : LLaVA 1.5 7B / 13B
objective : DPO loss
baseline : LLaVA, RLHF-V, VLfeedback, …
data : generated by iterating over iterations. seed is a randomly drawn subset 13K of the llava-instruction 150K data
evaluation : VLM bench(MME, SEED, LLaVA_w, MMBench, …), VQA(SQA, VisWiz, GQA), Hall-bench(POPE, CHAIR)
Result : VLM bench, VQA, hall-bench all improved
contribution :
etc. :

Details

Preliminary

LARGE LANGUAGE MODELS CAN SELF-IMPROVE https://arxiv.org/abs/2210.11610

Proposed

Generate samples with VLM (beam search decoding), assign a reward to each sentence, and score the entire sequence as the sum of these rewards. Pulling good/bad responses and learning from them for DPOs Generate samples etc. again with the trained VLM… Repeat this three times