[161] MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models & Tasks

TL;DR

I read this because… : Personal Research Related Research
task : Measure if VLM models are too focused on vision or language
Problem :** Traditional occulsion + accuracy based methodologies do not accurately measure which modality was emphasized.
IDEA: Score your model not on its accuracy, but on how much you influenced the model’s predictions.
input/output : {image, text} -> score(positive, negative, neutral) for each modality
architecture : ALBEF, CLIP, LXMERT, 4 VQA models
baseline : task accuracy
data : VQA, GQA, Image-sentence alignment(VQA, GQA), VALSE , FOIL
evaluation : T-SHAP, V-SHAP
result : -
contribution :
etc. :

CLIP cannot give negative points for incorrect words (keyboard).

It’s called the shapley basis of game theory.

Similar to occulsion based, but instead of each token, it makes a subset of token combinations to occulde. Too many combinations, so use subsampling

cheferCAM does not see negatives!