[172] RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback

paper , data , code

TL;DR

I read this because.. : VLM + RLHF
task : MLLM
problem : MLLM의 hallucination 문제. GPT4-V의 경우에도 45.9%가 hallucination이더라
idea : DPO 학습을 하자. 그런데 이때 정확하게 어떤 segment가 틀렸는지를 정답을 매기자.
input/output : {image, question} -> answer
architecture : 저자들의 전작인 Muffin . BEiT-3 + 13B Vicnuna 1.0 기반의 모델
objective : 살짝 수정된 DPO. DPO loss term에 들어가는 log-propb 부분 가중치가 조금 달라짐.
baseline : QwenVL-Chat, LLaVA, LLaVA1.5, Muffin, InstructBLIP, LLaVA-RLHF
data : human annotated 1.4K data
evaluation : Object HalBench, MMHAL-Bench, MHumanEval, LLaVA Bench, VQAv2
result : hallucination 관점에서 open model 중 sota.(일부 GPT4V를 이기도 함). LLAVA Bench의 경우 LLavA-RLHF가 좀더 좋긴 하지만 비등비등하게 좋음.
contribution : 효율적인 DPO 학습. 데이터 공개
etc. :

Details

overall

underlying challenges in human preference data

ambiguity 두 답변이 있을 때 각각의 장점, 단점이 있는데 둘중에 무엇을 선호하게 할지가 문제
learning efficiency reponse하나로 긴 답변에 대해 feedback을 해야하기 때문에 학습하기 어려워서 많은 데이터를 필요로 하고, 이러한 credit misallocation 문제로 reward hacking 등의 문제가 생김

fine-grained correctional human preference collection

segment level로 human annotation 시킴. hallucinated segments를 정정하는 방식. 정정 전/후가 $y_w$, $y_l$이 됨. 이때 데이터는 instruction data 소스에서 image description prompt를 GPT4로 만들고(?) answer는 muffin을 통해 받음(??)

이렇게 만들어진 데이터 통계는 64.4 단어의 2.65 corrected segments. hallucination type은 objects(41.2%), positions(20.3%), numbers(16.5%), attributes(10.0%), actions(5.3%), misc 가 있었음