[154] Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment

TL;DR

I read this because.. : 개인연구 관련. DSG에서 타고옴
task : image / text alignment with score!
problem : 기존의 alignment를 측정하는 방법론은 설명을 같이 제공하지 않는다.
idea : 데이터 및 벤치마크 제작
input/output : {image, text} -> score, misaligned text span, misaligned visual span, feedback
objective : zs or CE loss
baseline : PALI, mPLUG-Owl, miniGPT-2, LLaVA1.5, finetuned PALI
data : proposed TV Feedback // test는 AMT로 human 정제까지 함.
evaluation : binary accuracy(정확한 pair인지), text span(precision, exact match로 한건지 어떤지 잘 모르겠네), feedback이 정확한지는 NLI(BART-NLI 모델 사용), IoU .75
result : finetune PALI가 가장 성능이 좋았고 ood 데이터셋에서도 잘 동작함을 확인함
contribution : 내가 원하던 연구! 데이터셋 공개!
etc. :

1. Spcay로 POS를 뽑음. object(noun), attribute(adjective), action(verb), spatial relations 이렇게 4개의 분류로 나눔
1. PaLM2를 사용해서 (a) contradiction caption을 만들고 (b) 왜 contradiction인지 detailed caption을 만들고 (c) 캡션 내에 어떤 요소가 틀린지 pinpoint하라고 하고 (d) visual bounding box를 뽑으라고 함.
1. 생성된 contradiction caption이 정말 원래 캡션과 다른지 구분하기 위해서 Textual Entailment model을 사용해서
1. GroundingDINO를 사용해서 PALM2가 뽑은 bounding box의 textual label과 bounding box를 뽑음 이렇게 뽑은 셋을 Textual Visual Feedback 데이터라고 부름

SeeTrue dataset에 기반해서 위의 ConGen과 비슷한 방식으로 뽑은 뒤에 AMT에 태워서 2008개의 샘플을 인간이 검수함.

Image-text Alignment : binary accuracy
Textual Feeback Quality : BART NLI로 gt가 premise, prediction이 hypothesis
Misalignment in text : BART NLI를 사용해서 text segment가 맞는지 확인 (위의 방식과 비슷하겠지?)
Visual Misalignment Detection: F1-Score@0.75 로 잼 Alignment는 8100개의 SeeTRUE dataset에 포함되어있고 다른 metric은 SeeTrue-Feedback에 포함되어 있음.

최신 VLM모델들에게 아래와 같이 질의