Collect a dataset, VQA-X, that explains to VQA why the question is answered.
The MPII Human Pose (MHP) dataset on the right is a dataset about the pose of a person in a photo, and it also depends a lot on the surrounding objects and people, so we collected ACT-X with a line description. (c.f. Recently, CLEVR-X
was also added)
Additionally, you can use the labels that you find grounded in the image as ground truth for pointing to the
Propose a Pointing and Justification Explanation (PJ-X) model to answer queries and provide explanations for these dataset images.

results


idea
- These descriptions can also be used as descriptions in a few-shot in reverse.
- What if we built a dataset like this for DocVQA? Q: “Price of a mow?” A: “500 won” X: “Because they are in the same row”
related papers