image

a.k.a TextSpan paper , code

TL;DR

  • I read this because.. : Came across this while searching for CLIP spurious cues
  • task : Extract text representation of layer, head in CLIP ViT
  • idea : create 3948 common expression sentences with human + GPT, then pick the row with the highest variance from the image expression and add it to the projection.
  • input/output : {image, model} -> text explanation of ViT layer and heads
  • architecture : ViT-B-16, ViT-L-14, ViT-H-14
  • baseline : LRP, Partial-LRP, rollout, raw attention, GradCAM, Chefer2021
  • data : ImageNet(mean ablation), Waterbirds dataset(reducing spurious cues), ImageNet-Segmentation(zs-segmentation)
  • evaluation : accuracy(imagenet), worst-group accuracy(waterbird), pixel accuracy/mIoU/mAP (zs-segmenatation)
  • result : Only the last 4 MSA layers affect the final prediction, other layers have little effect, qualitatively very interesting result, sota in zs-segmentation
  • contribution : Proposed an algorithm to describe each representation of CLIP as text.
  • etc. :

Details

  • Multimodal neurons in artificial neural networks https://openai.com/index/multimodal-neurons
  • Thesis that CLIP’s layer-by-layer, head-by-head trained representation is highly interpretable
  • Disentangling visual and written concepts in CLIP
  • A paper that utilizes the above methodology to write and erase text to represent images.

Preliminary findings

image

Only the MSAs in the last 4 layers affected performance, and mean ablating the MLP or earlier MSA layers did not have a significant impact on performance.

Decomposition to head

image

MSA can be expressed as above $\alpha$ is the attention score

image

If we include the projection $P$, we get the expression above. This means that you can get a representation of each layer, head, patch, etc. by summing the projection and attention operations $c_{i, j, h}$ for each layer, head, patch, etc.

TextSpan algorithm

image

It looks complicated, but it’s not

  • Matrix-multiply the attention output $C\in\mathbb{R}${K\times d’}$ and the text representation $R\in\mathbb{R}^{M\times d’}$ by layer, head, and then find the representation j with the highest variance and add this $\tau$ to the projection. Then update C and R with this representation so that it is orthogonal to the following representation (similar to PCA)

These are the layer/head specific expressions image

Result

Quantitative

image

Qualitative

image image image