[164] TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering

TL;DR

I read this because… : Personal Research Related Research
task : faithful T2I evaluation
Problem :** Shortcomings of CLIPScore to evaluate whether an image was created to fit the prompt
idea : Let’s solve it with VQA!
input/output : {image, text} -> score
architecture : GPT-3 + UnifiedQA + VQA(mPLUG-large, BLIP-2.)
baseline : CLIPScore
evaluation : correlation with likert-scaled human preference
result : higher correlation