TL;DR
- I read this because.. : I follow Chefer in scholar, so he sends me emails (so convenient!)
- task : explainability in CLIP
- Problem :** Explanatory Power in CLIP Models
- IDEA : Find the derivative for the hidden representation of all layers and use it to calculate the
- input/output : {image, text} -> layer explainability maps
- architecture : CLIP ViT-B/16, -L/14, -H/14, -BigG/14, SigLIP
- baseline : LRP, Partial-LRP, rollout, Raw attention, GradCAM, CheferCAM, TextSpan
- data : ImageNet-S, OpenImage V7, ImageNet(perturbation)
- evaluation : segmentation (pixel acc, mIoU, mAP), ov segmentation(p-mIoU), perturbation test(neg, pos accuracy)
- result : SOTA.
- contribution : A model that uses all layers well. Model scale shows different aspects of the model depending on the layer.
- I don’t know why sensitivity is in the title, maybe it’s because I’m just reading it roughly?
Details
Methodology
The final output of ViT looks like this
where $\bar{z}$ is the pooled representaion (cls pool, attention pool)
Of these, the activation for our target class $c$ is called $s$.
Let’s call $A$ the attention map for this, and differentiate with respect to the attention map to get the following map
Takes ReLU and averages it by layer / head / patch(n).
Reshape again except for the cls token and do min-max normalization
Find $s^{l}$ for all layers in this way and create a map as a derivative of it
This is summation, not matrix multiplication, per layer
What’s different about Chefer 21? https://github.com/long8v/PTIR/issues/159#issuecomment-1933470637
- You don’t differentiate on the final output, you differentiate on the representation and inner product of the layer! (Most major)
- image_relevance = R[:, 0, 1:] : LeGrad clears CLS tokens rather last, while chefer uses a representation for CLS tokens. For $n \times n$, LeGrad writes a summation over all the last rows, while chefer writes the one that got caught in CLS
- chefer has something to initialize with identity matrix to represent residual connection, but there is no such thing
Result
Perturbation result
Layer Ablation
For small models, it was good to use only the last few layers, but as the model gets bigger, more layers should be used
ReLU ablation
It was good to have ReLU on. Not having it on doesn’t make it much worse.