image

paper , code

TL;DR

  • I read this because.. : I saw it on Facebook and wondered if it could be applied to CLIP evaluation? and read
  • task : evaluating faithfulness of image generation
  • problem : CLIPScore is not consistent in scale and interpretable depending on the style, QG/QA based is difficult to interpret what is wrong (no door or no blue door) when it is a compound question (is there a blue door?) no, and there are errors in the VQA model itself, such as saying there is no door but there is a blue door when there are multiple questions.
  • idea : Make each question atomic and graph them together so that if its parent is no, then all its children are no.
  • input/output : image + text -> graph(questions for node, semantics for its dependancy)
  • baseline : QA/QG
  • data : DSG-1k released with graph based on previous evaluation data such as TIFA . The way it was created is that the text corresponding to the image is 1) made into an entity tuple through LLM, 2) a question is created based on it, and 3) the depedancy of each tuple is also found.
  • evaluation : Did you answer the question for each image?
  • result : Seems to have solved the above problem. Among VLM models, PALI is the best performing
  • contribution : Improved QG/A-based evaluation to make fine-grained evaluation more interpretable.
  • And I was wondering, if you ask a kid like GPT4-V “is <description> well explained <img>?, what is wrong?” what would come up?

Details

QA/G based methodology

image

motivation

  • problem of clip score image

  • problem of QA/G method image

Proposed

image

Dataset source

image image

It’s a lot, but I don’t have time… Goodbye.