TL;DR
- I read this because… : Personal Research Related Research
- task : faithful T2I evaluation
- Problem :** Shortcomings of CLIPScore to evaluate whether an image was created to fit the prompt
- idea : Let’s solve it with VQA!
- input/output : {image, text} -> score
- architecture : GPT-3 + UnifiedQA + VQA(mPLUG-large, BLIP-2.)
- baseline : CLIPScore
- evaluation : correlation with likert-scaled human preference
- result : higher correlation
Details
motivation
TIFA overview
metric is how many answers are correct when VQA’ed
- GPT-3 prompt
TIFA detailed pipeline
Same as #182, but with everything in GPT-3 LLaMA-3 is also retrained to make it deterministic.
Question Filtering is a unified QA
TIFA v1.0 benchmark
- Likert Score guideline
- correlation between human preference