[157] LeGrad: An Explainability Method for Vision Transformers via Feature Formation Sensitivity

TL;DR

I read this because.. : I follow Chefer in scholar, so he sends me emails (so convenient!)
task : explainability in CLIP
Problem :** Explanatory Power in CLIP Models
IDEA : Find the derivative for the hidden representation of all layers and use it to calculate the
input/output : {image, text} -> layer explainability maps
architecture : CLIP ViT-B/16, -L/14, -H/14, -BigG/14, SigLIP
baseline : LRP, Partial-LRP, rollout, Raw attention, GradCAM, CheferCAM, TextSpan
data : ImageNet-S, OpenImage V7, ImageNet(perturbation)
evaluation : segmentation (pixel acc, mIoU, mAP), ov segmentation(p-mIoU), perturbation test(neg, pos accuracy)
result : SOTA.
contribution : A model that uses all layers well. Model scale shows different aspects of the model depending on the layer.
I don’t know why sensitivity is in the title, maybe it’s because I’m just reading it roughly?

Details

Methodology

The final output of ViT looks like this

where $\bar{z}$ is the pooled representaion (cls pool, attention pool) Of these, the activation for our target class $c$ is called $s$. Let’s call $A$ the attention map for this, and differentiate with respect to the attention map to get the following map

Takes ReLU and averages it by layer / head / patch(n).

Reshape again except for the cls token and do min-max normalization

Find $s^{l}$ for all layers in this way and create a map as a derivative of it

This is summation, not matrix multiplication, per layer

What’s different about Chefer 21? https://github.com/long8v/PTIR/issues/159#issuecomment-1933470637

You don’t differentiate on the final output, you differentiate on the representation and inner product of the layer! (Most major)
image_relevance = R[:, 0, 1:] : LeGrad clears CLS tokens rather last, while chefer uses a representation for CLS tokens. For $n \times n$, LeGrad writes a summation over all the last rows, while chefer writes the one that got caught in CLS
chefer has something to initialize with identity matrix to represent residual connection, but there is no such thing

Result

Perturbation result

Layer Ablation

For small models, it was good to use only the last few layers, but as the model gets bigger, more layers should be used

ReLU ablation

It was good to have ReLU on. Not having it on doesn’t make it much worse.

TL;DR#

Details#

Methodology#

Result#

Perturbation result#

Layer Ablation#

ReLU ablation#

Qualitative#