TL;DR
- I read this because.. : Came across this while searching for CLIP spurious cues
- task : Extract text representation of layer, head in CLIP ViT
- idea : create 3948 common expression sentences with human + GPT, then pick the row with the highest variance from the image expression and add it to the projection.
- input/output : {image, model} -> text explanation of ViT layer and heads
- architecture : ViT-B-16, ViT-L-14, ViT-H-14
- baseline : LRP, Partial-LRP, rollout, raw attention, GradCAM, Chefer2021
- data : ImageNet(mean ablation), Waterbirds dataset(reducing spurious cues), ImageNet-Segmentation(zs-segmentation)
- evaluation : accuracy(imagenet), worst-group accuracy(waterbird), pixel accuracy/mIoU/mAP (zs-segmenatation)
- result : Only the last 4 MSA layers affect the final prediction, other layers have little effect, qualitatively very interesting result, sota in zs-segmentation
- contribution : Proposed an algorithm to describe each representation of CLIP as text.
- etc. :
Details
related work
- Multimodal neurons in artificial neural networks https://openai.com/index/multimodal-neurons
- Thesis that CLIP’s layer-by-layer, head-by-head trained representation is highly interpretable
- Disentangling visual and written concepts in CLIP
- A paper that utilizes the above methodology to write and erase text to represent images.
Preliminary findings
Only the MSAs in the last 4 layers affected performance, and mean ablating the MLP or earlier MSA layers did not have a significant impact on performance.
Decomposition to head
MSA can be expressed as above $\alpha$ is the attention score
If we include the projection $P$, we get the expression above. This means that you can get a representation of each layer, head, patch, etc. by summing the projection and attention operations $c_{i, j, h}$ for each layer, head, patch, etc.
TextSpan algorithm
It looks complicated, but it’s not
- Matrix-multiply the attention output $C\in\mathbb{R}${K\times d’}$ and the text representation $R\in\mathbb{R}^{M\times d’}$ by layer, head, and then find the representation j with the highest variance and add this $\tau$ to the projection. Then update C and R with this representation so that it is orthogonal to the following representation (similar to PCA)
These are the layer/head specific expressions