image

paper , code

TL;DR

  • I read this because.. : ํŽ˜์ด์Šค๋ถ์—์„œ ๋ดค๊ณ  CLIP evaluation์— ์ ์šฉํ•ด๋ณผ ์ˆ˜ ์žˆ์ง€ ์•Š์„๊นŒ? ํ•˜๊ณ  ์ฝ์Œ
  • task : evaluating faithfulness of image generation
  • problem : CLIPScore๋Š” style์— ๋”ฐ๋ผ scale์ด ์ผ์ •ํ•˜์ง€ ์•Š๊ณ  ํ•ด์„๊ฐ€๋Šฅํ•˜์ง€ ์•Š์Œ, QG/QA ๊ธฐ๋ฐ˜์€ ๋ณตํ•ฉ์งˆ๋ฌธ(ํŒŒ๋ž€ ๋ฌธ์ด ์žˆ๋‹ˆ?) no์ผ ๋•Œ ๋ญ๊ฐ€ ํ‹€๋ฆฐ์ง€(๋ฌธ์ด ์—†๋Š”๊ฑด์ง€ ํŒŒ๋ž€ ๋ฌธ์ด ์—†๋Š”๊ฑด์ง€) ํ•ด์„์ด ์–ด๋ ต๊ณ  ์—ฌ๋Ÿฌ ์งˆ๋ฌธ์ด ์žˆ์„ ๋•Œ ๋ฌธ์€ ์—†๋‹ค๊ณ  ํ•ด๋†“๊ณ  ํŒŒ๋ž€ ๋ฌธ์€ ์žˆ๋‹ค๊ณ  ํ•˜๋Š” ๋“ฑ์˜ VQA model ์ž์ฒด์˜ error๊ฐ€ ์žˆ์Œ.
  • idea : ๊ฐ๊ฐ์˜ ์งˆ๋ฌธ์„ atomicํ•˜๊ฒŒ ๋งŒ๋“ค๊ณ  ์ด ์งˆ๋ฌธ๋“ค๋ผ๋ฆฌ graph๋กœ ๋งŒ๋“ค์–ด์„œ ์ด์˜ parent๊ฐ€ no์ด๋ฉด ์ด child๋Š” ๋‹ค no์ด๊ฒŒ ํ•˜์ž.
  • input/output : image + text -> graph(questions for node, semantics for its dependancy)
  • baseline : QA/QG
  • data : TIFA ๋“ฑ์˜ ์ด์ „ evaluation data ๊ธฐ๋ฐ˜์œผ๋กœ graph๋ฅผ ๋งŒ๋“  DSG-1k ๊ณต๊ฐœ. ์ด๊ฑธ ๋งŒ๋“  ๋ฐฉ์‹์€ image์— ํ•ด๋‹นํ•˜๋Š” text๋ฅผ LLM์„ ํ†ตํ•˜์—ฌ 1) entity tuple๋กœ ๋งŒ๋“  ๋’ค 2) ์ด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ question์„ ๋งŒ๋“ค๊ณ  3) ๊ฐ tuple์˜ depedancy๋„ ๊ตฌํ•จ
  • evaluation : ๊ฐ ์ด๋ฏธ์ง€์˜ question์— ๋งž๊ฒŒ ๋Œ€๋‹ต์„ ํ–ˆ๋Š”๊ฐ€?
  • result : ์œ„์˜ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐ ํ–ˆ๋‹ค๋Š” ๋“ฏ. VLM ๋ชจ๋ธ ์ค‘์—์„œ๋Š” PALI๊ฐ€ ๊ฐ€์žฅ ์šฐ์ˆ˜ํ•œ ์„ฑ์ 
  • contribution : fine-grainedํ•œ evaluation์„ ์ข€ ๋” ํ•ด์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•œ QG/A ๊ธฐ๋ฐ˜์˜ evaluation์„ ๊ฐœ์„ 
  • etc. : ์ƒ๊ฐํ•œ ๊ฒƒ๊ณผ ์ข€ ๋‹ค๋ฅด๊ธด ํ•จใ…‹ใ…‹ ๋ณ„๋„์˜ QG / QA ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ด์•ผ๋œ๋‹ค๋Š” ์ ? ๋ฐ์ดํ„ฐ์…‹์ด๋‚˜ ํ•œ๋ฒˆ ์‚ดํŽด๋ด์•ผ๋˜๋‚˜. ๊ทธ๋ฆฌ๊ณ  ๋ฌธ๋“ ๊ถ๊ธˆํ•ด์กŒ๋Š”๋ฐ GPT4-V์™€ ๊ฐ™์€ ์• ๋“คํ•œํ…Œ “is <description> well explained <img>?, what is wrong?” ํ•˜๋ฉด ๋ญ๊ฐ€ ๋‚˜์˜ค๋ ค๋‚˜?

Details

QA/G based methodology

image

motivation

  • problem of clip score image

  • problem of QA/G method image

Proposed

image

Dataset source

image image

๋ญ ๋งŽ์€๋ฐ ์‹œ๊ฐ„์ด ์—†์–ด์„œ .. ์ด๋งŒ..