TL;DR
- I read this because.. : Personal research related. Ride from DSG
- task : image / text alignment with score!
- problem : The existing methodology for measuring alignment does not come with an explanation.
- Idea:** Create data and benchmarks
- input/output : {image, text} -> score, misaligned text span, misaligned visual span, feedback
- objective : zs or CE loss
- baseline : PALI, mPLUG-Owl, miniGPT-2, LLaVA1.5, finetuned PALI
- data : proposed TV Feedback // test is AMT to human refinement.
- evaluation : binary accuracy (is it an exact pair), text span (precision, not sure if it’s an exact match or not), feedback is NLI (using BART-NLI model), IoU .75
- Result : Confirmed that finetune PALI performed the best and works well on the ood dataset
- contribution : The research I wanted! Dataset released!
- etc. :
Details
Image source
Proposed ConGen
- Spcay picked POS. Divide into 4 categories: object(noun), attribute(adjective), action(verb), and spatial relations.
- Using PaLM2, ask students to (a) create a contradiction caption, (b) create a caption detailing why it is a contradiction, (c) pinpoint which element within the caption is incorrect, and (d) draw a visual bounding box.
- To distinguish if the generated contradiction caption is indeed different from the original caption, we use the Textual Entailment model to determine whether the
- Use GroundingDINO to ground the textual label and bounding box of the bounding box drawn by PALM2 We’ll call this set the Textual Visual Feedback data.
SeeTrue-Feedback benchmark
Based on the SeeTrue dataset, 2008 samples were drawn in a similar fashion to ConGen above, burned on an AMT, and human reviewed.
Evaluation metrics
- Image-text Alignment : binary accuracy
- Textual Feeback Quality: BART NLI with gt is premise, prediction is hypothesis
- Misalignment in text: Use BART NLI to check if text segments are aligned (similar to above, right?)
- Visual Misalignment Detection: Jam with F1-Score@0.75 Alignment is included in the 8100 SeeTRUE dataset and the other metrics are included in SeeTrue-Feedback.
Result
Ask the latest VLM models the following query
limitation of model prediction
- Difficulty giving visual feedback for the absence of images
- Difficult to give feedback when there are multiple misalignments
- Bounding box is too loose