[19] Multimodal Explanations: Justifying Decisions and Pointing to the Evidence

paper

Collect a dataset, VQA-X, that explains to VQA why the question is answered. The MPII Human Pose (MHP) dataset on the right is a dataset about the pose of a person in a photo, and it also depends a lot on the surrounding objects and people, so we collected ACT-X with a line description. (c.f. Recently, CLEVR-X was also added)

Additionally, you can use the labels that you find grounded in the image as ground truth for pointing to the

Propose a Pointing and Justification Explanation (PJ-X) model to answer queries and provide explanations for these dataset images.

results

idea

These descriptions can also be used as descriptions in a few-shot in reverse.
What if we built a dataset like this for DocVQA? Q: “Price of a mow?” A: “500 won” X: “Because they are in the same row”

related papers

https://openaccess.thecvf.com/content_ECCV_2018/papers/Qing_Li_VQA-E_Explaining_Elaborating_ECCV_2018_paper.pdf