[161] MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models & Tasks

TL;DR

I read this because.. : 개인 연구 관련 연구
task : VLM 모델들이 vision 또는 language에 너무 치중하지 않는지 측정해보자
problem : 기존의 occulsion + accuracy based 방법론은 어떤 modality에 치중했는지를 정확히 측정하지 못한다.
idea : 모델의 정확도가 아니라 얼마나 모델 예측에 영향을 미쳤는지에 대한 score를 매기자
input/output : {image, text} -> 각 modality에 대한 score(positive, negative, neutral)
architecture : ALBEF, CLIP, LXMERT, 4 VQA models
baseline : task accuracy
data : VQA, GQA, Image-sentence alignment(VQA, GQA), VALSE , FOIL
evaluation : T-SHAP, V-SHAP
result : -
contribution :
etc. :

CLIP은 틀린 단어(keyboard)에 대해 negative 점수를 주지 못한다.

게임이론의 shapley 기반이라고 하넹

occulsion based랑 비슷한데 각 토큰이 아니라 토큰 조합까지 subset으로 만들어서 occulde 하는 방식. 너무 조합이 많으니까 subsampling해서 사용

cheferCAM은 negative 못본다!