[156] Interpreting CLIP's Image Representation via Text-Based Decomposition

a.k.a TextSpan paper , code

TL;DR

I read this because.. : Came across this while searching for CLIP spurious cues
task : Extract text representation of layer, head in CLIP ViT
idea : create 3948 common expression sentences with human + GPT, then pick the row with the highest variance from the image expression and add it to the projection.
input/output : {image, model} -> text explanation of ViT layer and heads
architecture : ViT-B-16, ViT-L-14, ViT-H-14
baseline : LRP, Partial-LRP, rollout, raw attention, GradCAM, Chefer2021
data : ImageNet(mean ablation), Waterbirds dataset(reducing spurious cues), ImageNet-Segmentation(zs-segmentation)
evaluation : accuracy(imagenet), worst-group accuracy(waterbird), pixel accuracy/mIoU/mAP (zs-segmenatation)
result : Only the last 4 MSA layers affect the final prediction, other layers have little effect, qualitatively very interesting result, sota in zs-segmentation
contribution : Proposed an algorithm to describe each representation of CLIP as text.
etc. :

Details

Multimodal neurons in artificial neural networks https://openai.com/index/multimodal-neurons
Thesis that CLIP’s layer-by-layer, head-by-head trained representation is highly interpretable
Disentangling visual and written concepts in CLIP
A paper that utilizes the above methodology to write and erase text to represent images.

Preliminary findings

Only the MSAs in the last 4 layers affected performance, and mean ablating the MLP or earlier MSA layers did not have a significant impact on performance.

Decomposition to head

MSA can be expressed as above $\alpha$ is the attention score

If we include the projection $P$, we get the expression above. This means that you can get a representation of each layer, head, patch, etc. by summing the projection and attention operations $c_{i, j, h}$ for each layer, head, patch, etc.

TextSpan algorithm

It looks complicated, but it’s not

Matrix-multiply the attention output $C\in\mathbb{R}${K\times d’}$ and the text representation $R\in\mathbb{R}^{M\times d’}$ by layer, head, and then find the representation j with the highest variance and add this $\tau$ to the projection. Then update C and R with this representation so that it is orthogonal to the following representation (similar to PCA)

These are the layer/head specific expressions

[156] Interpreting CLIP's Image Representation via Text-Based Decomposition

TL;DR

Details

Preliminary findings

Decomposition to head

TextSpan algorithm

Result

Quantitative

Qualitative

TL;DR#

Details#

related work#

Preliminary findings#

Decomposition to head#

TextSpan algorithm#

Result#

Quantitative#

Qualitative#

TL;DR

Details

related work

Preliminary findings

Decomposition to head

TextSpan algorithm

Result

Quantitative

Qualitative