image

paper , code

TL;DR

  • I read this because.. : aka. CheferCAM. interested in explainable CLIP scores. published colab in this paper repo and you can visualize the results by token.
  • task : explainability in neural network
  • problem : I want to do not only self-attention in the previous TiBA(https://github.com/long8v/PTIR/issues/158) , but also co-attention, enocder-decoder structure in multi-modal environment.
  • idea : write gradient for attention map instead of gradient for previous output (==LRP)
  • input/output : model // heatmap for text or vision tokens
  • architecture : ViT, VisualBERT, LXMERT, DETR
  • baseline : rollout, raw attention, Grad-CAM, Partial LRP, TiBA
  • evaluation : perturbation(both in image and text token for VisualBERT), weakly, semantic segmentation
  • result : Better performance than its predecessor
  • contribution : cross-attention, co-attention work that also makes the work explainable. ICCV oral
  • etc. : Before deep taylor decomposition is something tired, but if I ignore that and just read this paper, I don’t need theoretical content and it seems clean… and the performance is good. On the contrary, there is no theoretical content, so it feels a bit clunky. In the case of CLIP, the final output is embedding, but then it seems that it is not a visualization for CLIPscore…? colab It seems like you need to look closely.

Details

some notation

  • i is an image token
  • T is a text token
  • A^{tt}$ is the self-attention between texts / $A^{ii}$ is the self-attention between images
  • Let $A^{ti}$ be the multimodal attention interaction

Relevancy initialization

We’re going to initialize/update the relevancy map image

Before SA, they have no interaction with each other, so $R^{ii}$, $R^{tt}$ are identities. $R^{it}$ is a zero tensor.

Relevancy update rules

We’ll update relavancy with attention map A Average across heads and use gradient as per predecessor

image

where $\delta A$ is the differentiation of $y_t$, the output for the class t we want to visualize, with A. Leave only the positives before taking the average (clamp) (there is no particular reason for this, it just follows its predecessor) image

image

The relevance update for self attention works like this where s is the query token and q is the key token.

Here, $R^{xx}$ can be separated into two parts: $I$, the initialization, and $\hat{R}^{xx}$, the residual after subtracting $I$. Because $\hat{R}^{xx}$ uses a gradient, the numbers are absolutely small. To fix this, we normalize the rows so that they sum to 1. image

For co-attention / cross-attention, define the update rule as follows image

Obtaining classification relevancies

The relevancy map corresponds to the rows of the [CLS] token, for example, the first row of $R^{tt}$ for text and the first row of $R^{ti}$ for images.

Adaptation to attention type

image
  • When tokens from both modalities are concatenated and enter the SA: can be made into a Relevancy map of rows ($R^{i+t}$) corresponding to [cls] tokens in the whole $R^{(i+t, i+t)}$.
  • Both modalities are SA first and exchange information with each other as CAs (co-attention): The propagation described above must be done. The relavancy map can then be viewed in the same way as the relevancy in the classification model
  • Encoder-decoder structure: cross-attention is only in one direction, so equation 11 is not needed

Result

image image image image image