image

paper , page , code

TL;DR

  • I read this because… : Personal Research Related Research
  • task : faithful T2I evaluation
  • Problem :** Shortcomings of CLIPScore to evaluate whether an image was created to fit the prompt
  • idea : Let’s solve it with VQA!
  • input/output : {image, text} -> score
  • architecture : GPT-3 + UnifiedQA + VQA(mPLUG-large, BLIP-2.)
  • baseline : CLIPScore
  • evaluation : correlation with likert-scaled human preference
  • result : higher correlation

Details

motivation

image

TIFA overview

image

metric is how many answers are correct when VQA’ed image

  • GPT-3 prompt image

TIFA detailed pipeline

image

Same as #182, but with everything in GPT-3 LLaMA-3 is also retrained to make it deterministic.

Question Filtering is a unified QA

TIFA v1.0 benchmark

image image
  • Likert Score guideline image
image
  • correlation between human preference image