image

paper , code

TL;DR

  • I read this because… : Personal Research Related Research
  • task : Measure if VLM models are too focused on vision or language
  • Problem :** Traditional occulsion + accuracy based methodologies do not accurately measure which modality was emphasized.
  • IDEA: Score your model not on its accuracy, but on how much you influenced the model’s predictions.
  • input/output : {image, text} -> score(positive, negative, neutral) for each modality
  • architecture : ALBEF, CLIP, LXMERT, 4 VQA models
  • baseline : task accuracy
  • data : VQA, GQA, Image-sentence alignment(VQA, GQA), VALSE , FOIL
  • evaluation : T-SHAP, V-SHAP
  • result : -
  • contribution :
  • etc. :

Details

motivation

image

CLIP cannot give negative points for incorrect words (keyboard).

SHAP

It’s called the shapley basis of game theory. image

image

Similar to occulsion based, but instead of each token, it makes a subset of token combinations to occulde. Too many combinations, so use subsampling

why not attention based?

image

cheferCAM does not see negatives!

image