image

paper

TL;DR

  • I read this because.. : VLM + RLHF
  • task : LVLM
  • problem : hallucination
  • idea : human annotation์„ segment level๋กœ ๋ฐ›์•„์„œ hallucination์„ ์ธก์ • + rejection sampling / DPO์ฒ˜๋Ÿผ ํ•™์Šตํ•˜์ž
  • input/output : {image, question} -> class(accurate, inaccurate, analysis)
  • architecture : InstructBLIP
  • objective : CE loss or proposed FDPO loss
  • baseline : InstructBLIP, LLaVA, mPLUG-OWL
  • data : (proposed) 16K image-prompt-response
  • evaluation : RM Score(true segment ๋Œ€ํ•œ NLL), human eval(percent of content that was truthful? ๋ฌธ์žฅ๋‹จ์œ„์ธ์ง€..
  • result : Reward model์„ ํ•™์Šตํ•˜๊ณ  rejection sampling ํ–ˆ์„ ๋•Œ ์„ฑ๋Šฅ ๊ฐœ์„ . ์ œ์•ˆํ•œ FDPO๋„ ์„ฑ๋Šฅ ๊ฐœ์„ .
  • contribution : ๋ฒค์น˜๋งˆํฌ ๊ณต๊ฐœ, VLM์— RLHFํ•œ ๊ฝค ์ดˆ๊ธฐ์ž‘์ธ๋“ฏ
  • etc. : MHALDetect ๋ฒค์น˜๋งˆํฌ๊ฐ€ ์ž˜ ๋˜์–ด์„œ ๊ทธ๋Ÿฐ์ง€ ์ธ์šฉ์ˆ˜๋Š” ๋งŽ์€๋ฐ ๋ญ”๊ฐ€ ์ž˜ ์•ˆ ์ฝํžˆ๋„น..

Details

์•„๋ž˜์™€ ๊ฐ™์ด annotation image

4000 images - instructBLIP response (10 human annotated) class๋Š” accurate, inaccurate, analysis, unsure 4๊ฐœ

์ด์ค‘ 3200๊ฐœ๋ฅผ val split –> ์ด๊ฒŒ ์•„๋งˆ MHALDetect

Method

  • Multi-Modal Reward Model Instruct BLIP ์‚ฌ์šฉ. ๊ฐ sentence level์˜ eos token์— classifier (accurate, inaccurate, analysis) ๋‹ฌ์•„์„œ ํ•™์Šตํ•˜๋Š” ๋ฐฉ์‹ segment-level reward model์˜ ๊ฒฝ์šฐ ๊ฐ segment(๋ฐ์ดํ„ฐ ๊นŒ๋ณด๋‹ˆ๊นŒ ๋‹ค๋ฅธ label์ด ๋‚˜์˜ค๊ธฐ ๊นŒ์ง€ ๊ทธ๋ƒฅ ๋‹ค ์ด์–ด์ง)์˜ ๋์— classifier ๋‹ฌ์Œ. ์ด๊ฑด ์™œ ํ•œ์ง€ ๋ชจ๋ฅด๊ฒ ์Œ..!

  • Rejection sampling ์ œ๋Œ€๋กœ ๋œ ์„ค๋ช…์ด ์—†๋Š”๋ฐ.. inference์—์„œ sampling ์žˆ๊ฒŒ ๋ฝ‘์€ ๋’ค์— ๊ฐ sentence level๋กœ RM ๋ชจ๋ธ์—์„œ negative log likelihood ๊ฐ’์„ ๊ฐ€์ง€๊ณ  hallucination์ด ์žˆ๋Š”์ง€ ์—†๋Š”์ง€ ํŒ๋‹จํ•ด์„œ ์‚ฌ์šฉํ•˜๋Š” ๋“ฏ best-of-n, worst-of-n ์œผ๋กœ ๋ฝ‘์Œ. ์ด๋•Œ n์€ 16, 64 image

  • fine-grained direct preference optimization DPO์™€ ๋‹ฌ๋ฆฌ ์ด ๊ฒฝ์šฐ pair๊ฐ€ ์—†์–ด์„œ ๊ทธ๋ƒฅ segment level๋กœ loss๋ฅผ ๋ถ€๊ณผ

image
  • $x$ : ํ˜„์žฌ segment ์ด์ „๊นŒ์ง€์˜ ํ† ํฐ๋“ค
  • $y$ : generated segment
  • $c$ : class of current segment
    • 1 : preferred classs (correct)
    • 0 : dispreferred class (incorrect, optionalํ•˜๊ฒŒ analysis๋„)

Result

  • reward model์˜ ์„ฑ๋Šฅ image

  • rejection sampling / finegrained DPO result image

RM Score๋Š” ์ž˜ ์™€๋‹ฟ์ง€ ์•Š์Œ.. Human Eval์—์„œ ์„ฑ๋Šฅ ๊ฐœ์„ . ๋‹ค๋ฅธ hallucination bench๋‚˜ VLM ๋ฒค์น˜๋Š” ์ฐ์–ด๋ณด์ง€ ์•Š์Œ.