TL;DR
- I read this because.. : a.k.a TiBA. interested in explainable CLIP scores. read as preliminary
- task : interpertability of neural network
- problem : Applying the existing Layer-wise Relevance Propagation (LRP) method to a transformer requires (1) skip-connection and (2) ReLU is not used in activation, resulting in a negative result.
- Idea :** (1) change it so that both positive and negative are interchangeable, (2) add a normalization term, and (3) combine the attention and relevancy scores to get a score.
- input/output : image -> class // heatmap in image
- architecture : ViT-B, BERT
- baseline : rollout, raw attention, GradCAM, LRP, partial LRP
- data : ImageNet 2012, ImageNext-Segmentation, Movie Reviews
- evaluation : AUC(perturbation tests), pixel accuray / mAP / mIoU(segmentation), token-F1(Movie Reviews)
- result : Good performance compared to before
- contribution : explainabity in transformer is slow with attention flow
- etc. : When I read it, I recognized the person who explained it in the CVPR tutorial in the past! Also, when I looked for the paper, there was Ms. Kim who was in the XAI side, but she was a Korean woman, so I was glad to see her.
Details
Method
The goal is to get the LRP-based relevance for each attention head in the transformer and combine it with the gradient for class-specific visualization.
Relevance and gradients
The gradient of $y_{t}$ (the output of the model for class t) over $x_j^{(n)}$, index j of input x of the nth layer by the chain rule, is defined as follows
If we define $L^{(n)}(X, Y)$ as a layer operation on two tensors X, Y, then the two tensors become feature map / weight, and if we subject them to the Deep Taylor Decomposition, the relevance is obtained as follows.
deep taylor decomposition uses taylor approximation to find relevance http://arxiv.org/abs/1512.02479
Approximating low output with a gradient~ Understanding only to a point
The conservative rule dictates that the sum of the relevance of the nth layer and the sum of the relevance of the (n - 1)th layer must be equal.
This also comes from the paper above, and means that f(x) equals the sum of relevance.
The LRP paper assumes ReLU as activation, so we only see positive values
- $v^+$ : max(0, v)
But if you use something like GeLU, you can get negative results.
So I changed this to ask for kids that are just a postivie subset (…? I don’t know what difference this makes)
And set the very first initialization to a one-hot vector for class t
non parametric relevance propagation
transformer has two operations that mix two feature map tensors (<=> before it was weight and feature map, which is different) (1) skip connection (2) matrix multiplication
Given two tensors u, v, we define the two operators as follows
relevance score for leading u / relevance score for trailing v relevance score tended to be too large on skip connections. To fix this, we added normalization
Relevance and gradient diffusion
If you use the above procedure for the self-attention operation, you will get the following result
$A^{(b)}$ : attention map of the bth block $E_h$: Average for heads dimension Leaves only the positive part of the gradient.
For rollout, it simply iteratively multiplies over the attention map
Obtaining the image relevance map
The result is a matrix C of s x s.
Each row shows how that token relates to other tokens in the relevance map
Since we only focused on the classification model in this study, we only get the relevance score for [CLS]' token For ViT, the sequence length s is subtracted from [CLS]` and resized as $\sqrt{s-1} \times \sqrt{s-1}$ to resize and then interpolate to get
Result
qualitative
pertubation
See how the top-1 accuracy changes by masking out the ones you said were important, with positive being a gradual erasure of the important ones (lower is better).
segmentation
token-f1