image

paper , code

TL;DR

  • I read this because.. : a.k.a TiBA. interested in explainable CLIP scores. read as preliminary
  • task : interpertability of neural network
  • problem : Applying the existing Layer-wise Relevance Propagation (LRP) method to a transformer requires (1) skip-connection and (2) ReLU is not used in activation, resulting in a negative result.
  • Idea :** (1) change it so that both positive and negative are interchangeable, (2) add a normalization term, and (3) combine the attention and relevancy scores to get a score.
  • input/output : image -> class // heatmap in image
  • architecture : ViT-B, BERT
  • baseline : rollout, raw attention, GradCAM, LRP, partial LRP
  • data : ImageNet 2012, ImageNext-Segmentation, Movie Reviews
  • evaluation : AUC(perturbation tests), pixel accuray / mAP / mIoU(segmentation), token-F1(Movie Reviews)
  • result : Good performance compared to before
  • contribution : explainabity in transformer is slow with attention flow
  • etc. : When I read it, I recognized the person who explained it in the CVPR tutorial in the past! Also, when I looked for the paper, there was Ms. Kim who was in the XAI side, but she was a Korean woman, so I was glad to see her.

Details

Method

image

The goal is to get the LRP-based relevance for each attention head in the transformer and combine it with the gradient for class-specific visualization.

Relevance and gradients

The gradient of $y_{t}$ (the output of the model for class t) over $x_j^{(n)}$, index j of input x of the nth layer by the chain rule, is defined as follows

image

If we define $L^{(n)}(X, Y)$ as a layer operation on two tensors X, Y, then the two tensors become feature map / weight, and if we subject them to the Deep Taylor Decomposition, the relevance is obtained as follows.

image

deep taylor decomposition uses taylor approximation to find relevance http://arxiv.org/abs/1512.02479 image

Approximating low output with a gradient~ Understanding only to a point

The conservative rule dictates that the sum of the relevance of the nth layer and the sum of the relevance of the (n - 1)th layer must be equal. image

This also comes from the paper above, and means that f(x) equals the sum of relevance. image

The LRP paper assumes ReLU as activation, so we only see positive values image

  • $v^+$ : max(0, v)

But if you use something like GeLU, you can get negative results. So I changed this to ask for kids that are just a postivie subset (…? I don’t know what difference this makes) image image

And set the very first initialization to a one-hot vector for class t

non parametric relevance propagation

transformer has two operations that mix two feature map tensors (<=> before it was weight and feature map, which is different) (1) skip connection (2) matrix multiplication

Given two tensors u, v, we define the two operators as follows image

relevance score for leading u / relevance score for trailing v relevance score tended to be too large on skip connections. To fix this, we added normalization

image

Relevance and gradient diffusion

image

If you use the above procedure for the self-attention operation, you will get the following result image

$A^{(b)}$ : attention map of the bth block $E_h$: Average for heads dimension Leaves only the positive part of the gradient.

For rollout, it simply iteratively multiplies over the attention map

image

Obtaining the image relevance map

The result is a matrix C of s x s. Each row shows how that token relates to other tokens in the relevance map Since we only focused on the classification model in this study, we only get the relevance score for [CLS]' token For ViT, the sequence length s is subtracted from [CLS]` and resized as $\sqrt{s-1} \times \sqrt{s-1}$ to resize and then interpolate to get

Result

  • qualitative image

  • pertubation image

See how the top-1 accuracy changes by masking out the ones you said were important, with positive being a gradual erasure of the important ones (lower is better).

  • segmentation image

  • token-f1 image