[178] RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness

paper , code

TL;DR

I read this because.. : RLHF-V Successor
task : vision-RLHF
problem : human annotated preference data is not scalable
Idea :** Get evaluated by peer LVLMs. Let’s divide the reward into logical units and turn it into a binary question and score it as correct or incorrect.
input/output : {image, question} -> answer
architecture : LLaVA 1.5, OmniLMM
objective : DPO loss
baseline : VCD, Less-is-more, LURE, QEWEn-VL, LLaVA-NeXT, Minigemini, HA-DPO, POVID, LLaVA-RLHF, Silikie, RLHF-V
data : image - instruction from {MSCOCO, ShareGPT-4V, MovieNet, GoogleLandmark v2, VQA v2, OKVQA, TextVQA} => DPO data
evaluation : trustworthiness(Object Halbench, MMHal-Bench, MHumanEval, AMBER), helpfulness(LLaVA Bench, MMStar)
result : Performance beyond GPT-4V in trustworthiness.
contribution : Quickly glue RLAIF to VLM!
etc. : iterative alignment, etc. … seems like a lot of work

Details

performance

RLAIF-V

response generation Using a different seed to generate n answers for the targeted model
response evaluation

divide Because the answer is long and contains multiple statements, we break it into atomic statements.
conquer Turn each claim into a binary question to measure its trustworthiness (the unconditional answer is yes). It then asks the labeler model a question and scores the answer.
combine Find the final score S as $-n_{rej}$ when $n_{rej}$ answers to the claim are marked with more answers It then finds the two answer pairs with the difference in score and samples up to two pairs per instruction. At this point, filtering process, etc. made no sense.

iterative alignment If you simply apply DPO, there is a “distribution shift problem” where the model output distribution changes during the training process. (Citation for this is Scaling Laws for Reward Model Overoptimization) To address this, we propose an iterative alignment that iterates through Training -> DPO data collection -> Training

Generate with the most recent instruction model $M_i$ and use the divide-and-conquer strategy above to create pairs, learn them, and repeat

Experiment

hparams
- base models
LLaVA 1.5 as the instruction model – the corresponding labeler model is LLaVA-NeXT (Nous-Hermes-2-Yi-34B)
OmniLMM – The corresponding Labeler model is the same Labeler model (no-RLHF)
- 4 epochs, lr 5e-7, beta 0.1, bs 8
- 4 iterations (4K instrctions)
8 I’m learning 7B /12B with A100 and I’m having trouble with
- data collection 48h / 50h
training takes 6h / 8h

Result

analysis

deconfounded strategy

Ablation of scoring with divide-and-conquer strategies

RLHF-V trained with fine-grained human feedback
adapted is the preferred response replaced with human annotation

ours came out the best. The rlhf-v data is a fine-grained correction of Muffin inference results, while adapted is rewritten by humans, so the performance difference is quite large. I wondered, is it really that important to use your own inference results when doing DPO?

Self rewarding vs divide-and-conquer Self-rewarding simply asks the labeler to give a long prompt and score the response.

Your suggestion definitely worked.

iterative alignment

The performance of the non-iterative alignment method seemed to saturate quickly.

data source // multiple lvlm Performance was consistently better when written with other data

It worked well with a variety of Lvlms, with OmniLMM performing the best.

TL;DR#

Details#

performance#

RLAIF-V#

Experiment#

Result#

analysis#

deconfounded strategy#