TL;DR
- I read this because… : Personal Research Related Research
- task : Measure if VLM models are too focused on vision or language
- Problem :** Traditional occulsion + accuracy based methodologies do not accurately measure which modality was emphasized.
- IDEA: Score your model not on its accuracy, but on how much you influenced the model’s predictions.
- input/output : {image, text} -> score(positive, negative, neutral) for each modality
- architecture : ALBEF, CLIP, LXMERT, 4 VQA models
- baseline : task accuracy
- data : VQA, GQA, Image-sentence alignment(VQA, GQA), VALSE , FOIL
- evaluation : T-SHAP, V-SHAP
- result : -
- contribution :
- etc. :
Details
motivation
CLIP cannot give negative points for incorrect words (keyboard).
SHAP
It’s called the shapley basis of game theory.
Similar to occulsion based, but instead of each token, it makes a subset of token combinations to occulde. Too many combinations, so use subsampling
why not attention based?
cheferCAM does not see negatives!