image

paper , code

TL;DR

  • I read this because.. : RLHF-V Successor
  • task : vision-RLHF
  • problem : human annotated preference data is not scalable
  • Idea :** Get evaluated by peer LVLMs. Let’s divide the reward into logical units and turn it into a binary question and score it as correct or incorrect.
  • input/output : {image, question} -> answer
  • architecture : LLaVA 1.5, OmniLMM
  • objective : DPO loss
  • baseline : VCD, Less-is-more, LURE, QEWEn-VL, LLaVA-NeXT, Minigemini, HA-DPO, POVID, LLaVA-RLHF, Silikie, RLHF-V
  • data : image - instruction from {MSCOCO, ShareGPT-4V, MovieNet, GoogleLandmark v2, VQA v2, OKVQA, TextVQA} => DPO data
  • evaluation : trustworthiness(Object Halbench, MMHal-Bench, MHumanEval, AMBER), helpfulness(LLaVA Bench, MMStar)
  • result : Performance beyond GPT-4V in trustworthiness.
  • contribution : Quickly glue RLAIF to VLM!
  • etc. : iterative alignment, etc. … seems like a lot of work

Details

performance

image

RLAIF-V

image

  1. response generation Using a different seed to generate n answers for the targeted model

  2. response evaluation

  • divide Because the answer is long and contains multiple statements, we break it into atomic statements.

  • conquer Turn each claim into a binary question to measure its trustworthiness (the unconditional answer is yes). It then asks the labeler model a question and scores the answer.

  • combine Find the final score S as $-n_{rej}$ when $n_{rej}$ answers to the claim are marked with more answers It then finds the two answer pairs with the difference in score and samples up to two pairs per instruction. At this point, filtering process, etc. made no sense.

  1. iterative alignment If you simply apply DPO, there is a “distribution shift problem” where the model output distribution changes during the training process. (Citation for this is Scaling Laws for Reward Model Overoptimization) To address this, we propose an iterative alignment that iterates through Training -> DPO data collection -> Training image

Generate with the most recent instruction model $M_i$ and use the divide-and-conquer strategy above to create pairs, learn them, and repeat

Experiment

  • hparams
    • base models
  • LLaVA 1.5 as the instruction model – the corresponding labeler model is LLaVA-NeXT (Nous-Hermes-2-Yi-34B)
  • OmniLMM – The corresponding Labeler model is the same Labeler model (no-RLHF)
    • 4 epochs, lr 5e-7, beta 0.1, bs 8
    • 4 iterations (4K instrctions)
  • 8 I’m learning 7B /12B with A100 and I’m having trouble with
    • data collection 48h / 50h
  • training takes 6h / 8h

Result

image

analysis

deconfounded strategy

Ablation of scoring with divide-and-conquer strategies image

  • RLHF-V trained with fine-grained human feedback
  • adapted is the preferred response replaced with human annotation

ours came out the best. The rlhf-v data is a fine-grained correction of Muffin inference results, while adapted is rewritten by humans, so the performance difference is quite large. I wondered, is it really that important to use your own inference results when doing DPO?

  • Self rewarding vs divide-and-conquer Self-rewarding simply asks the labeler to give a long prompt and score the response. image

Your suggestion definitely worked.

  • iterative alignment image

The performance of the non-iterative alignment method seemed to saturate quickly.

  • data source // multiple lvlm Performance was consistently better when written with other data image

It worked well with a variety of Lvlms, with OmniLMM performing the best.