image paper

Collect a dataset, VQA-X, that explains to VQA why the question is answered. image The MPII Human Pose (MHP) dataset on the right is a dataset about the pose of a person in a photo, and it also depends a lot on the surrounding objects and people, so we collected ACT-X with a line description. (c.f. Recently, CLEVR-X was also added)

image Additionally, you can use the labels that you find grounded in the image as ground truth for pointing to the

Propose a Pointing and Justification Explanation (PJ-X) model to answer queries and provide explanations for these dataset images. image

results image

image

idea

  • These descriptions can also be used as descriptions in a few-shot in reverse.
  • What if we built a dataset like this for DocVQA? Q: “Price of a mow?” A: “500 won” X: “Because they are in the same row”

related papers