image

paper

TL;DR

  • I read this because.. : google scholar๊ฐ€ ์ถ”์ฒœํ•ด์คŒ
  • task : VLM + RLHF
  • problem : VLM์˜ hallucination ํ•ด๊ฒฐํ•˜๊ณ  ์‹ถ์€๋ฐ ์‹ธ๊ฒŒ DPO ํ•™์Šต์šฉ ๋ฐ์ดํ„ฐ ๋ชป๋งŒ๋“ค๊นŒ?
  • idea : CLIP score ๊ฐ€์ง€๊ณ  ๋งŒ๋“ค๊นŒ?
  • input/output : {image, question} -> score
  • architecture : MobileVLM-v2), LLaVA 1.5
  • objective : DPO loss
  • baseline : BLIP-2, InstructBLIP, Shira, OpenFlamingo, Qwn-VL … ShareGPT4V, DPO ๊ธฐ๋ฒ•์œผ๋กœ๋Š” HA-DPO
  • data : ์ด๋ฏธ์ง€ ์†Œ์Šค๋Š” SFT, MobileVLM-v2๋กœ ๋งŒ๋“ค๊ณ  CLIP score์™€ ํœด๋ฆฌ์Šคํ‹ฑ์œผ๋กœ ํ•„ํ„ฐ๋ง ํ•จ. CLIP Score ๊ธฐ์ค€ 2์ด์ƒ ๋‚˜๋Š” ๊ฒƒ์„ win / loose ํŽ˜์–ด๋ฅผ ๋งŒ๋“ฆ
  • evaluation : AMBER , CLIP์—์„œ ํ‰๊ฐ€ํ•˜๋Š” ๋ถ„๋ฅ˜(caption ์ƒ์„ฑํ•˜๋ผ๊ณ  ํ•œ ๋’ค siglip์œผ๋กœ zero-shot classification), VLM benchs(GQA, SQA, VQA, MME, MMB)
  • result : AMBER ๊ฐœ์„ . QwenVL, GPT4V ๋ง๊ณ  AMBER sota. ๋‹ค๋ฅธ ๋ฒค์น˜๋งˆํฌ๋Š” ์„ฑ๋Šฅ์„ ์•…ํ™”์‹œํ‚ค์ง„ ์•Š์œผ๋ฉฐ SQA๋‚˜ MMB๋Š” ๊ฐœ์„ ์‹œํ‚ค๊ธฐ๋„?
  • contribution : ์‹ธ๊ฒŒ DPO data ๋งŒ๋“ค๊ธฐ.
  • etc. :

Details

  • why CLIP? ์•„๋ž˜์™€ ๊ฐ™์ด hallucination์„ ๋งŒ๋“  ๋’ค์— CLIP vs LLaVA 1.5 logit ๋น„๊ต image
image

bar = hallucinated caption์— ๋Œ€ํ•ด logit์„ ๋” ํฌ๊ฒŒ ํ• ๋‹นํ•œ ๊ฒƒ (์ง„ํŒŒ๋ž‘ llava 1.5 / ํ•˜๋Š˜์ƒ‰ CLIP)

CLIP์ด VLM๋ณด๋‹ค๋Š” hallucinated object, attribute, relation์„ ์ž˜ ๋ฝ‘์•„๋‚ธ๋‹ค!

  • CLIP-DPO DPO ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋ฐ”๊พผ ๊ฒƒ์ด ์—†๊ณ  ๋ฐ์ดํ„ฐ ํ’€๋งŒ ๋ฐ”๊ฟˆ image

  • data image

  1. generation : ๊ฐ€๋ฒผ์šด VLM (๋…ผ๋ฌธ์—์„  MobileVLM-v2 family)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋‘๊ฐ€์ง€ ํ˜•ํƒœ๋กœ ๋งŒ๋“ฆ
  • generic caption Mobile VLM v2 ๋ชจ๋ธ๋“ค์—๊ฒŒ caption ๋งŒ๋“ค์–ด๋‹ฌ๋ผ๊ณ  ํ•จ. 5๊ฐœ์˜ ํ”„๋กฌํ”„ํŠธ ์‚ฌ์šฉ

  • per-image QA image

Mistral 7B์—๊ฒŒ ์ด๋ฏธ์ง€์—์„œ ์งˆ๋ฌธ๊ณผ ๋งž๋Š” ๋‹ต๋ณ€, ํ‹€๋ฆฐ ๋‹ต๋ณ€์„ ๋งŒ๋“ค๋ผ๊ณ  ํ•จ

  1. data annotation
  • CLIP ranking : CLIPScore๋ฅผ ๋‹ค ๋‹ด

  • Global filtering :

    • text ๊ฐ€ ๋“ค์–ด์žˆ๋Š” ์ด๋ฏธ์ง€๊ฐ€ CLIPScore๊ฐ€ ๋†’์•„์„œ ์ œ๊ฑฐ
    • CLIPScore threshold ์ดํ•˜ ์ œ๊ฑฐ
    • long caption ์ œ๊ฑฐ
    • question๋„ CLIPScore์žฌ์„œ ๋‚ฎ์€๊ฒƒ ์ œ๊ฑฐ (e.g. โ€œwhat is the main object in the image?โ€) image
  • Pair filtering :

    • QA์˜ ๊ฒฝ์šฐ Q์—์„œ ์ด๋ฏธ์ง€์— ๋Œ€ํ•œ ์„ค๋ช…์„ regex๋กœ ๋บ€ ๋‹ค์Œ์— ๋Œ€๋‹ต๊ณผ concatํ›„ CLIPScore๊ฐ€ ๋‚ฎ์€๊ฑธ ์ •์ œ (?)
    • CLIPScore์˜ ์ฐจ์ด๊ฐ€ 2 ์ด์ƒ์ธ ๊ฒƒ๋งŒ
    • ์บก์…˜ ๊ธธ์ด๊ฐ€ ๋„ˆ๋ฌด ๋‹ค๋ฅด์ง€ ์•Š์€ ๊ฒƒ๋งŒ

์ตœ์ข…์ ์œผ๋กœ 750K pair ํ™•๋ณด – ์ด ์ค‘ 50K๊ฐ€ QA ๋‚˜๋จธ์ง€๋Š” 700K๋Š” caption

image

Result

image image image