TL;DR
- I read this because.. : RLHF-V Successor
- task : vision-RLHF
- problem : human annotated preference data is not scalable
- Idea :** Get evaluated by peer LVLMs. Let’s divide the reward into logical units and turn it into a binary question and score it as correct or incorrect.
- input/output : {image, question} -> answer
- architecture : LLaVA 1.5, OmniLMM
- objective : DPO loss
- baseline : VCD, Less-is-more, LURE, QEWEn-VL, LLaVA-NeXT, Minigemini, HA-DPO, POVID, LLaVA-RLHF, Silikie, RLHF-V
- data : image - instruction from {MSCOCO, ShareGPT-4V, MovieNet, GoogleLandmark v2, VQA v2, OKVQA, TextVQA} => DPO data
- evaluation : trustworthiness(Object Halbench, MMHal-Bench, MHumanEval, AMBER), helpfulness(LLaVA Bench, MMStar)
- result : Performance beyond GPT-4V in trustworthiness.
- contribution : Quickly glue RLAIF to VLM!
- etc. : iterative alignment, etc. … seems like a lot of work
Details
performance
RLAIF-V
response generation Using a different seed to generate n answers for the targeted model
response evaluation
divide Because the answer is long and contains multiple statements, we break it into atomic statements.
conquer Turn each claim into a binary question to measure its trustworthiness (the unconditional answer is yes). It then asks the labeler model a question and scores the answer.
combine Find the final score S as $-n_{rej}$ when $n_{rej}$ answers to the claim are marked with more answers It then finds the two answer pairs with the difference in score and samples up to two pairs per instruction. At this point, filtering process, etc. made no sense.
- iterative alignment
If you simply apply DPO, there is a “distribution shift problem” where the model output distribution changes during the training process. (Citation for this is Scaling Laws for Reward Model Overoptimization)
To address this, we propose an iterative alignment that iterates through Training -> DPO data collection -> Training
Generate with the most recent instruction model $M_i$ and use the divide-and-conquer strategy above to create pairs, learn them, and repeat
Experiment
- hparams
- base models
- LLaVA 1.5 as the instruction model – the corresponding labeler model is LLaVA-NeXT (Nous-Hermes-2-Yi-34B)
- OmniLMM – The corresponding Labeler model is the same Labeler model (no-RLHF)
- 4 epochs, lr 5e-7, beta 0.1, bs 8
- 4 iterations (4K instrctions)
- 8 I’m learning 7B /12B with A100 and I’m having trouble with
- data collection 48h / 50h
- training takes 6h / 8h
Result
analysis
deconfounded strategy
Ablation of scoring with divide-and-conquer strategies
- RLHF-V trained with fine-grained human feedback
- adapted is the preferred response replaced with human annotation
ours came out the best. The rlhf-v data is a fine-grained correction of Muffin inference results, while adapted is rewritten by humans, so the performance difference is quite large. I wondered, is it really that important to use your own inference results when doing DPO?
- Self rewarding vs divide-and-conquer
Self-rewarding simply asks the labeler to give a long prompt and score the response.
Your suggestion definitely worked.
- iterative alignment
The performance of the non-iterative alignment method seemed to saturate quickly.
- data source // multiple lvlm
Performance was consistently better when written with other data
It worked well with a variety of Lvlms, with OmniLMM performing the best.