image

paper

TL;DR

  • I read this because.. : Personal research related. Recommended by Claude AI
  • task : contrastive explanation. Explain why you chose B over A
  • problem : I want the model to be explainable, but I can’t enumerate all of the explanations and it’s simpler to explain why A over B.
  • idea : Subtract the rows of weight W used to predict the final model class y, then project it and multiply it with hidden. The multiplied value is then forwarded several times with the text span masked, and the values are compared to highlight the values with the most change.
  • input/output : text - > class // text span highlighted for why model predicts class y over y'
  • architecture : RoBERTa
  • objective : MLM
  • baseline : -
  • data : NLI, BIOS (task to get biographies and categorize occupations)
  • evaluation : Poorly understood.
  • result : Poor understanding. Like you only evaluated qualitatively?
  • contribution : Seems to be almost pioneering work in the area of contrastive explanation.
  • etc. :

Details

image image

method

Once masked, we will do multiple model forwards. We call this the amnesic methodology.

  • K : model class
  • y : output class
  • enc : neural encoder
  • $W \in \mathbb{R}^{K \times d}$ : final linear layer
  • $y^*$ : model prediction (fact) / $y’$ : alternative prediction
  • $p$ : model probabilities
  • $w_{y^*}$, $w_{y’}$ : rows used to predict the two classes in the weight matrix W.

Let $w_{y^*}$, $w_{y’}$ be the two weight rows in one contrasting direction $u$. image

If the model predicts $y^$ to be higher, then $u^Th_x>0$.

image

Use this u to create a projection for the hidden state $h_x$. The result of this C operation is a matrix that can be interpreted as a contrastive intervention on $h_x$. Then we do the same operation as before, $q = \text{softmax}(Wh_x)$, and get the coefficient of the text span as shown below.

image

where p is the model prediction of the value without projection and q is the value with projection.

result

image

Results are hard to interpret…