[154] Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment

TL;DR

I read this because.. : Personal research related. Ride from DSG
task : image / text alignment with score!
problem : The existing methodology for measuring alignment does not come with an explanation.
Idea:** Create data and benchmarks
input/output : {image, text} -> score, misaligned text span, misaligned visual span, feedback
objective : zs or CE loss
baseline : PALI, mPLUG-Owl, miniGPT-2, LLaVA1.5, finetuned PALI
data : proposed TV Feedback // test is AMT to human refinement.
evaluation : binary accuracy (is it an exact pair), text span (precision, not sure if it’s an exact match or not), feedback is NLI (using BART-NLI model), IoU .75
Result : Confirmed that finetune PALI performed the best and works well on the ood dataset
contribution : The research I wanted! Dataset released!
etc. :

1. Spcay picked POS. Divide into 4 categories: object(noun), attribute(adjective), action(verb), and spatial relations.
1. Using PaLM2, ask students to (a) create a contradiction caption, (b) create a caption detailing why it is a contradiction, (c) pinpoint which element within the caption is incorrect, and (d) draw a visual bounding box.
1. To distinguish if the generated contradiction caption is indeed different from the original caption, we use the Textual Entailment model to determine whether the
1. Use GroundingDINO to ground the textual label and bounding box of the bounding box drawn by PALM2 We’ll call this set the Textual Visual Feedback data.

Based on the SeeTrue dataset, 2008 samples were drawn in a similar fashion to ConGen above, burned on an AMT, and human reviewed.

Image-text Alignment : binary accuracy
Textual Feeback Quality: BART NLI with gt is premise, prediction is hypothesis
Misalignment in text: Use BART NLI to check if text segments are aligned (similar to above, right?)
Visual Misalignment Detection: Jam with F1-Score@0.75 Alignment is included in the 8100 SeeTRUE dataset and the other metrics are included in SeeTrue-Feedback.

Ask the latest VLM models the following query