image

paper , code

TL;DR

  • I read this because.. : ๊ฐœ์ธ ์—ฐ๊ตฌ ๊ด€๋ จ ์—ฐ๊ตฌ
  • task : VLM ๋ชจ๋ธ๋“ค์ด vision ๋˜๋Š” language์— ๋„ˆ๋ฌด ์น˜์ค‘ํ•˜์ง€ ์•Š๋Š”์ง€ ์ธก์ •ํ•ด๋ณด์ž
  • problem : ๊ธฐ์กด์˜ occulsion + accuracy based ๋ฐฉ๋ฒ•๋ก ์€ ์–ด๋–ค modality์— ์น˜์ค‘ํ–ˆ๋Š”์ง€๋ฅผ ์ •ํ™•ํžˆ ์ธก์ •ํ•˜์ง€ ๋ชปํ•œ๋‹ค.
  • idea : ๋ชจ๋ธ์˜ ์ •ํ™•๋„๊ฐ€ ์•„๋‹ˆ๋ผ ์–ผ๋งˆ๋‚˜ ๋ชจ๋ธ ์˜ˆ์ธก์— ์˜ํ–ฅ์„ ๋ฏธ์ณค๋Š”์ง€์— ๋Œ€ํ•œ score๋ฅผ ๋งค๊ธฐ์ž
  • input/output : {image, text} -> ๊ฐ modality์— ๋Œ€ํ•œ score(positive, negative, neutral)
  • architecture : ALBEF, CLIP, LXMERT, 4 VQA models
  • baseline : task accuracy
  • data : VQA, GQA, Image-sentence alignment(VQA, GQA), VALSE , FOIL
  • evaluation : T-SHAP, V-SHAP
  • result : -
  • contribution :
  • etc. :

Details

motivation

image

CLIP์€ ํ‹€๋ฆฐ ๋‹จ์–ด(keyboard)์— ๋Œ€ํ•ด negative ์ ์ˆ˜๋ฅผ ์ฃผ์ง€ ๋ชปํ•œ๋‹ค.

SHAP

๊ฒŒ์ž„์ด๋ก ์˜ shapley ๊ธฐ๋ฐ˜์ด๋ผ๊ณ  ํ•˜๋„น image

image

occulsion based๋ž‘ ๋น„์Šทํ•œ๋ฐ ๊ฐ ํ† ํฐ์ด ์•„๋‹ˆ๋ผ ํ† ํฐ ์กฐํ•ฉ๊นŒ์ง€ subset์œผ๋กœ ๋งŒ๋“ค์–ด์„œ occulde ํ•˜๋Š” ๋ฐฉ์‹. ๋„ˆ๋ฌด ์กฐํ•ฉ์ด ๋งŽ์œผ๋‹ˆ๊นŒ subsamplingํ•ด์„œ ์‚ฌ์šฉ

why not attention based?

image

cheferCAM์€ negative ๋ชป๋ณธ๋‹ค!

image