[147] Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers

paper , code

TL;DR

I read this because.. : aka. CheferCAM. interested in explainable CLIP scores. published colab in this paper repo and you can visualize the results by token.
task : explainability in neural network
problem : I want to do not only self-attention in the previous TiBA(https://github.com/long8v/PTIR/issues/158) , but also co-attention, enocder-decoder structure in multi-modal environment.
idea : write gradient for attention map instead of gradient for previous output (==LRP)
input/output : model // heatmap for text or vision tokens
architecture : ViT, VisualBERT, LXMERT, DETR
baseline : rollout, raw attention, Grad-CAM, Partial LRP, TiBA
evaluation : perturbation(both in image and text token for VisualBERT), weakly, semantic segmentation
result : Better performance than its predecessor
contribution : cross-attention, co-attention work that also makes the work explainable. ICCV oral
etc. : Before deep taylor decomposition is something tired, but if I ignore that and just read this paper, I don’t need theoretical content and it seems clean… and the performance is good. On the contrary, there is no theoretical content, so it feels a bit clunky. In the case of CLIP, the final output is embedding, but then it seems that it is not a visualization for CLIPscore…? colab It seems like you need to look closely.

Details

some notation

i is an image token
T is a text token
A^{tt}$ is the self-attention between texts / $A^{ii}$ is the self-attention between images
Let $A^{ti}$ be the multimodal attention interaction

Relevancy initialization

We’re going to initialize/update the relevancy map

Before SA, they have no interaction with each other, so $R^{ii}$, $R^{tt}$ are identities. $R^{it}$ is a zero tensor.

Relevancy update rules

We’ll update relavancy with attention map A Average across heads and use gradient as per predecessor

where $\delta A$ is the differentiation of $y_t$, the output for the class t we want to visualize, with A. Leave only the positives before taking the average (clamp) (there is no particular reason for this, it just follows its predecessor)

The relevance update for self attention works like this where s is the query token and q is the key token.

Here, $R^{xx}$ can be separated into two parts: $I$, the initialization, and $\hat{R}^{xx}$, the residual after subtracting $I$. Because $\hat{R}^{xx}$ uses a gradient, the numbers are absolutely small. To fix this, we normalize the rows so that they sum to 1.

For co-attention / cross-attention, define the update rule as follows

Obtaining classification relevancies

The relevancy map corresponds to the rows of the [CLS] token, for example, the first row of $R^{tt}$ for text and the first row of $R^{ti}$ for images.

Adaptation to attention type

When tokens from both modalities are concatenated and enter the SA: can be made into a Relevancy map of rows ($R^{i+t}$) corresponding to [cls] tokens in the whole $R^{(i+t, i+t)}$.
Both modalities are SA first and exchange information with each other as CAs (co-attention): The propagation described above must be done. The relavancy map can then be viewed in the same way as the relevancy in the classification model
Encoder-decoder structure: cross-attention is only in one direction, so equation 11 is not needed

TL;DR#

Details#

some notation#

Relevancy initialization#

Relevancy update rules#

Obtaining classification relevancies#

Adaptation to attention type#

Result#