[153] Contrastive Explanations for Model Interpretability

paper

TL;DR

I read this because.. : Personal research related. Recommended by Claude AI
task : contrastive explanation. Explain why you chose B over A
problem : I want the model to be explainable, but I can’t enumerate all of the explanations and it’s simpler to explain why A over B.
idea : Subtract the rows of weight W used to predict the final model class y, then project it and multiply it with hidden. The multiplied value is then forwarded several times with the text span masked, and the values are compared to highlight the values with the most change.
input/output : text - > class // text span highlighted for why model predicts class y over y'
architecture : RoBERTa
objective : MLM
baseline : -
data : NLI, BIOS (task to get biographies and categorize occupations)
evaluation : Poorly understood.
result : Poor understanding. Like you only evaluated qualitatively?
contribution : Seems to be almost pioneering work in the area of contrastive explanation.
etc. :

Details

method

Once masked, we will do multiple model forwards. We call this the amnesic methodology.

K : model class
y : output class
enc : neural encoder
$W \in \mathbb{R}^{K \times d}$ : final linear layer
$y^*$ : model prediction (fact) / $y’$ : alternative prediction
$p$ : model probabilities
$w_{y^*}$, $w_{y’}$ : rows used to predict the two classes in the weight matrix W.

Let $w_{y^*}$, $w_{y’}$ be the two weight rows in one contrasting direction $u$.

If the model predicts $y^$ to be higher, then $u^Th_x>0$.

Use this u to create a projection for the hidden state $h_x$. The result of this C operation is a matrix that can be interpreted as a contrastive intervention on $h_x$. Then we do the same operation as before, $q = \text{softmax}(Wh_x)$, and get the coefficient of the text span as shown below.

where p is the model prediction of the value without projection and q is the value with projection.

result

Results are hard to interpret…

TL;DR#

Details#

method#

result#

TL;DR

Details

method

result