image

paper , page , dataset

TL;DR

  • I read this because.. : Personal research related. Ride from DSG
  • task : image / text alignment with score!
  • problem : The existing methodology for measuring alignment does not come with an explanation.
  • Idea:** Create data and benchmarks
  • input/output : {image, text} -> score, misaligned text span, misaligned visual span, feedback
  • objective : zs or CE loss
  • baseline : PALI, mPLUG-Owl, miniGPT-2, LLaVA1.5, finetuned PALI
  • data : proposed TV Feedback // test is AMT to human refinement.
  • evaluation : binary accuracy (is it an exact pair), text span (precision, not sure if it’s an exact match or not), feedback is NLI (using BART-NLI model), IoU .75
  • Result : Confirmed that finetune PALI performed the best and works well on the ood dataset
  • contribution : The research I wanted! Dataset released!
  • etc. :

Details

image image image

Image source

image

Proposed ConGen

    1. Spcay picked POS. Divide into 4 categories: object(noun), attribute(adjective), action(verb), and spatial relations.
    1. Using PaLM2, ask students to (a) create a contradiction caption, (b) create a caption detailing why it is a contradiction, (c) pinpoint which element within the caption is incorrect, and (d) draw a visual bounding box.
    1. To distinguish if the generated contradiction caption is indeed different from the original caption, we use the Textual Entailment model to determine whether the
    1. Use GroundingDINO to ground the textual label and bounding box of the bounding box drawn by PALM2 We’ll call this set the Textual Visual Feedback data.

SeeTrue-Feedback benchmark

Based on the SeeTrue dataset, 2008 samples were drawn in a similar fashion to ConGen above, burned on an AMT, and human reviewed.

image

Evaluation metrics

  • Image-text Alignment : binary accuracy
  • Textual Feeback Quality: BART NLI with gt is premise, prediction is hypothesis
  • Misalignment in text: Use BART NLI to check if text segments are aligned (similar to above, right?)
  • Visual Misalignment Detection: Jam with F1-Score@0.75 Alignment is included in the 8100 SeeTRUE dataset and the other metrics are included in SeeTrue-Feedback.
image

Result

Ask the latest VLM models the following query image

image image image

limitation of model prediction

image
  • Difficulty giving visual feedback for the absence of images
  • Difficult to give feedback when there are multiple misalignments
  • Bounding box is too loose