image

paper

TL;DR

  • I read this because.. : ๋ฐ์ดํ„ฐ ์–ด๋–ป๊ฒŒ ๋งŒ๋“ค์—ˆ๋‚˜ / ํ‰๊ฐ€ ๋ฐฉ์‹์€ ์–ด๋–ค๊ฐ€ ๋ณด๊ณ  ์‹ถ์–ด์„œ
  • task : proposed. (1) FOIL Detection (2) FOIL word detection (3) FOIL word correction
  • problem : captioning, VQA ๋ชจ๋ธ๊ณผ ๊ฐ™์€ VLM ๋ชจ๋ธ๋“ค์ด ์ •๋ง ๋‘ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ฅผ์ž˜ ์ดํ•ดํ•˜๊ณ  ์žˆ๋Š”๊ฒŒ ๋งž๋‚˜?
  • idea : caption์˜ word๋ฅผ ๋น„์Šทํ•œ ๋‹ค๋ฅธ ๋‹จ์–ด๋กœ ์น˜ํ™˜
  • input/output : {image, caption} -> (1) FOIL์ธ์ง€ ์•„๋‹Œ์ง€ (2) FOIL word๊ฐ€ ์–ด๋”˜์ง€ (3) FOIL word correction
  • objective : ce loss
  • baseline : ๋‹น์‹œ sota VQA, Caption ๋ชจ๋ธ / caption๋งŒ ๋ณธ LSTM, CNN LSTM
  • data : COCO์˜ caption์„ ํ™œ์šฉํ•ด์„œ 65K(train) / 32K(test)์˜ ์ด๋ฏธ์ง€, 197K(train) / 99K(test)์˜ caption.
  • evaluation : (1) accuracy (2) FOIL caption ์ค‘์— word๋ฅผ ์ž˜ ์ฐพ์•˜๋‚˜. noun์œผ๋กœ๋งŒ ํ‰๊ฐ€ / ์ „์ฒด ๋ช…์‚ฌ๋กœ ํ‰๊ฐ€ (3) FOIL word๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ ์›๋ž˜์˜ ๋‹จ์–ด๋กœ ๋‹ค ๋ฐ”๋€Œ๋‚˜
  • contribution : ์ดํ›„ hallucination measure ๋“ฑ์œผ๋กœ ์‚ฌ์šฉ๋จ
  • etc. :
    • 17๋…„๋„์—์„œ ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฐ€์žฅ ํ•ฉ๋ฆฌ์ ์ธ ๋ฐฉ๋ฒ•์œผ๋กœ ๋งŒ๋“ฆ
    • ๋ณ„๋กœ ์œ ๋ช…ํ•œ evaluation set์€ ์•„๋‹Œ ๋“ฏ -> ์ตœ๊ทผ LVLM benchmark๋กœ ํ•˜๋Š”๊ฒŒ ๋” ๋‚˜์„์ง€๋„ ๋ชจ๋ฅด๊ฒ ๋‹ค
      • single noun ํ•˜๋‚˜๋งŒ ๋ฐ”๊พผ๋‹ค๋Š”๊ฒŒ ์ข€ ๋‹จ์ ์ด๋ ค๋‚˜

Details

Task

image

num samples

image

๋ฐ์ดํ„ฐ ์ œ์ž‘ ๋ฐฉ์‹

image
  1. MS-COCO์—์„œ ๊ฐ™์€ supercategory๋ฅผ ๊ฐ€์ง„ object๋กœ pair๋ฅผ ๋งŒ๋“ฌ
  • ์ด ๋•Œ, ๋‹จ์–ด๊ฐ€ 2๊ฐœ ์ด์ƒ์ธ ์• ๋“ค์„ ๋บŒ. e.g. traffic light
  1. train / test category๋ฅผ ๋‚˜๋ˆ”
  • ํ•™์Šต์— ์‚ฌ์šฉ๋œ targe::foil pair๋Š” test์— ์‚ฌ์šฉ๋˜์ง€ ์•Š์„ ๊ฒƒ์ž„
  1. foil caption์„ ๋งŒ๋“ฆ
  • ์ด๋•Œ, caption์— ๋“ค์–ด๊ฐ„ ๋‹จ์–ด๋ฅผ ๊ต์ฒดํ•จ
  • ๊ทธ๋ฆฌ๊ณ  ์ด๋ฏธ์ง€ ๋‚ด์— ์กด์žฌํ•˜์ง€ ์•Š๋Š” object์— ๋Œ€ํ•ด ๊ต์ฒดํ•จ
  • e.g. “๊ฐ•์•„์ง€์™€ ๊ณ ์–‘์ด๊ฐ€ ๋ฐฅ์„ ๋จน๋Š”๋‹ค"์—์„œ ๊ณ ์–‘์ด๊ฐ€ ์žˆ์œผ๋ฏ€๋กœ ๊ฐ•์•„์ง€๋ฅผ ๊ณ ์–‘์ด๋กœ ๊ต์ฒดํ•˜์ง€๋Š” ์•Š์Œ
  1. Neuraltalk์ด๋ž€ captioning ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ด์„œ ๊ฐ€์žฅ ์–ด๋ ค์šด caption์œผ๋กœ ์„ ํƒํ•จ

Evaluation

  • T1์€ ๊ทธ๋ƒฅ ๋ถ„๋ฅ˜
  • T2๋Š” {image, FOIL caption}์ด ์ฃผ์–ด์กŒ์„ ๋•Œ foil word๋ฅผ ์ฐพ๋Š”์ง€
  • T3๋Š” {image, FOIL caption, FOIL word}๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ foil word๋ฅผ ์ž˜ ๊ณ ์น˜๋Š”์ง€
image

T1์˜ ๊ฒฝ์šฐ ์›๋ž˜ caption์— ๊ฐ ๋‹จ์–ด๋“ค์„ ์ง€์šฐ๊ณ  captioner ๋ชจ๋ธ๋กœ ์ƒ์„ฑ์„ ํ•˜๋ผ๊ณ  ํ•œ ๋’ค์— ๊ทธ ๋‹จ์–ด๋กœ ์น˜ํ™˜ํ•œ ์บก์…˜๊ณผ ์›๋ž˜ ์บก์…˜ ์ค‘์— ๋ชจ๋ธ์ด ๋” ๋†’๊ฒŒ ์˜ˆ์ธกํ•œ ๊ฐ’์„ ๋น„๊ตํ•ด์„œ ์น˜ํ™˜ํ•œ ์บก์…˜์ด ๋” ๋†’์œผ๋ฉด FOIL์œผ๋กœ ํŒ๋‹จ

image

T2์˜ ๊ฒฝ์šฐ Towards Transparent AI Systems: Interpreting Visual Question Answering Models (https://arxiv.org/pdf/1608.08974.pdf ) image ์—์„œ ์‚ฌ์šฉ๋œ occulsion ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉ. ๋ญ๋ƒ ํ•˜๋ฉด question์˜ ๋‹จ์–ด๋“ค์„ ํ•˜๋‚˜์”ฉ maskํ•˜๊ณ  forward๋ฅผ ํ•œ ๋’ค์— original predicted answer์— ๋Œ€ํ•ด score๊ฐ€ ์–ผ๋งˆ๋‚˜ ๋ฐ”๋€Œ์—ˆ๋Š”์ง€๋กœ ์ธก์ •

image

T3์˜ ๊ฒฝ์šฐ target word์— ๋Œ€ํ•œ linear regression์„ ์ˆ˜ํ–‰ (์–˜๋งŒ ์ƒˆ๋กœ ํ•™์Šตํ•˜๋Š”๋“ฏ?)

Analysis

image

์ž˜๋ชป๋งŒ๋“ค์–ด์ง„ ๋ฐ์ดํ„ฐ์…‹ image