image

paper , data , code

TL;DR

  • I read this because.. : VLM + RLHF
  • task : MLLM
  • problem : MLLM์˜ hallucination ๋ฌธ์ œ. GPT4-V์˜ ๊ฒฝ์šฐ์—๋„ 45.9%๊ฐ€ hallucination์ด๋”๋ผ
  • idea : DPO ํ•™์Šต์„ ํ•˜์ž. ๊ทธ๋Ÿฐ๋ฐ ์ด๋•Œ ์ •ํ™•ํ•˜๊ฒŒ ์–ด๋–ค segment๊ฐ€ ํ‹€๋ ธ๋Š”์ง€๋ฅผ ์ •๋‹ต์„ ๋งค๊ธฐ์ž.
  • input/output : {image, question} -> answer
  • architecture : ์ €์ž๋“ค์˜ ์ „์ž‘์ธ Muffin . BEiT-3 + 13B Vicnuna 1.0 ๊ธฐ๋ฐ˜์˜ ๋ชจ๋ธ
  • objective : ์‚ด์ง ์ˆ˜์ •๋œ DPO. DPO loss term์— ๋“ค์–ด๊ฐ€๋Š” log-propb ๋ถ€๋ถ„ ๊ฐ€์ค‘์น˜๊ฐ€ ์กฐ๊ธˆ ๋‹ฌ๋ผ์ง.
  • baseline : QwenVL-Chat, LLaVA, LLaVA1.5, Muffin, InstructBLIP, LLaVA-RLHF
  • data : human annotated 1.4K data
  • evaluation : Object HalBench, MMHAL-Bench, MHumanEval, LLaVA Bench, VQAv2
  • result : hallucination ๊ด€์ ์—์„œ open model ์ค‘ sota.(์ผ๋ถ€ GPT4V๋ฅผ ์ด๊ธฐ๋„ ํ•จ). LLAVA Bench์˜ ๊ฒฝ์šฐ LLavA-RLHF๊ฐ€ ์ข€๋” ์ข‹๊ธด ํ•˜์ง€๋งŒ ๋น„๋“ฑ๋น„๋“ฑํ•˜๊ฒŒ ์ข‹์Œ.
  • contribution : ํšจ์œจ์ ์ธ DPO ํ•™์Šต. ๋ฐ์ดํ„ฐ ๊ณต๊ฐœ
  • etc. :

Details

overall

image

underlying challenges in human preference data

  1. ambiguity ๋‘ ๋‹ต๋ณ€์ด ์žˆ์„ ๋•Œ ๊ฐ๊ฐ์˜ ์žฅ์ , ๋‹จ์ ์ด ์žˆ๋Š”๋ฐ ๋‘˜์ค‘์— ๋ฌด์—‡์„ ์„ ํ˜ธํ•˜๊ฒŒ ํ• ์ง€๊ฐ€ ๋ฌธ์ œ
  2. learning efficiency reponseํ•˜๋‚˜๋กœ ๊ธด ๋‹ต๋ณ€์— ๋Œ€ํ•ด feedback์„ ํ•ด์•ผํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ•™์Šตํ•˜๊ธฐ ์–ด๋ ค์›Œ์„œ ๋งŽ์€ ๋ฐ์ดํ„ฐ๋ฅผ ํ•„์š”๋กœ ํ•˜๊ณ , ์ด๋Ÿฌํ•œ credit misallocation ๋ฌธ์ œ๋กœ reward hacking ๋“ฑ์˜ ๋ฌธ์ œ๊ฐ€ ์ƒ๊น€

fine-grained correctional human preference collection

segment level๋กœ human annotation ์‹œํ‚ด. hallucinated segments๋ฅผ ์ •์ •ํ•˜๋Š” ๋ฐฉ์‹. ์ •์ • ์ „/ํ›„๊ฐ€ $y_w$, $y_l$์ด ๋จ. ์ด๋•Œ ๋ฐ์ดํ„ฐ๋Š” instruction data ์†Œ์Šค์—์„œ image description prompt๋ฅผ GPT4๋กœ ๋งŒ๋“ค๊ณ (?) answer๋Š” muffin์„ ํ†ตํ•ด ๋ฐ›์Œ(??)

์ด๋ ‡๊ฒŒ ๋งŒ๋“ค์–ด์ง„ ๋ฐ์ดํ„ฐ ํ†ต๊ณ„๋Š” 64.4 ๋‹จ์–ด์˜ 2.65 corrected segments. hallucination type์€ objects(41.2%), positions(20.3%), numbers(16.5%), attributes(10.0%), actions(5.3%), misc ๊ฐ€ ์žˆ์—ˆ์Œ

Dense Direct Preference Optimization

  • DPO loss recap image

($\beta$ 0.5)

์—ฌ๊ธฐ์„œ log-prob ๋ถ€๋ถ„์—์„œ corrected segment($y_c$)์— ์†ํ•˜๋Š”์ง€ ์•„๋‹Œ์ง€(unchanged, $y_u$)์— ๋”ฐ๋ผ ๊ฐ€์ค‘์„ ๋‘์ž๊ณ  ํ•˜๋Š”๊ฒŒ proposed DDPO

image
  • $\gamma$ : 5
  • $N$: len($y_u$) + $\gamma$ len($y_c$)
    • 1/N์€ ๊ธธ์–ด์ง€๋Š” longer response์— ๋Œ€ํ•œ ์„ ํ˜ธ๋ฅผ ํ†ต์ œํ•˜๊ธฐ ์œ„ํ•ด ์žˆ์Œ

Result

image image image image image

Ablations

image image