[139] Davidsonian Scene Graph: Improving Reliability in Fine-Grained Evaluation for Text-to-Image Generation

TL;DR

I read this because.. : I saw it on Facebook and wondered if it could be applied to CLIP evaluation? and read
task : evaluating faithfulness of image generation
problem : CLIPScore is not consistent in scale and interpretable depending on the style, QG/QA based is difficult to interpret what is wrong (no door or no blue door) when it is a compound question (is there a blue door?) no, and there are errors in the VQA model itself, such as saying there is no door but there is a blue door when there are multiple questions.
idea : Make each question atomic and graph them together so that if its parent is no, then all its children are no.
input/output : image + text -> graph(questions for node, semantics for its dependancy)
baseline : QA/QG
data : DSG-1k released with graph based on previous evaluation data such as TIFA . The way it was created is that the text corresponding to the image is 1) made into an entity tuple through LLM, 2) a question is created based on it, and 3) the depedancy of each tuple is also found.
evaluation : Did you answer the question for each image?
result : Seems to have solved the above problem. Among VLM models, PALI is the best performing
contribution : Improved QG/A-based evaluation to make fine-grained evaluation more interpretable.
And I was wondering, if you ask a kid like GPT4-V “is <description> well explained <img>?, what is wrong?” what would come up?

It’s a lot, but I don’t have time… Goodbye.