[157] LeGrad: An Explainability Method for Vision Transformers via Feature Formation Sensitivity

TL;DR

I read this because.. : Chefer를 scholar에서 follow 하니까 메일을 보내줌 (되게 편하네!)
task : explainability in CLIP
problem : CLIP 모델에서의 설명력
idea : 모든 레이어의 hidden representation에 대해 미분을 구해서
input/output : {image, text} -> layer explainability maps
architecture : CLIP ViT-B/16, -L/14, -H/14, -BigG/14, SigLIP
baseline : LRP, Partial-LRP, rollout, Raw attention, GradCAM, CheferCAM, TextSpan
data : ImageNet-S, OpenImage V7, ImageNet(perturbation)
evaluation : segmentation (pixel acc, mIoU, mAP), ov segmentation(p-mIoU), perturbation test(neg, pos accuracy)
result : SOTA.
contribution : 모든 레이어를 잘 사용한 모델. 모델 스케일 레이어별로 모델 양상이 다른걸 보임.
etc. : 대충 읽어서 그런가? sensitivity가 제목에 왜 들어간질 모르겠네

Details

Methodology

ViT의 최종 output을 아래와 같이 표현

여기서 $\bar{z}$은 pooled representaion (cls pool, attention pool) 이 중 우리의 target class $c$에 대한 activation을 $s$라고 함. 이에 대한 attention map을 $A$라고 할 때 attention map에 대하여 미분하여 아래와 같은 map을 만듦

ReLU를 취하고 layer / head / patch(n) 별로 평균을 구함.

cls 토큰을 제외하고 다시 reshape을 하고 min-max normalization을 해줌

이런 식으로 모든 레이어의 $s^{l}$를 구하고 이에 대한 미분으로 map을 만들어줌

이걸 레이어별로 matrix multiplication이 아니라 summation을 해줌

Chefer 21과 뭐가 다른가? https://github.com/long8v/PTIR/issues/159#issuecomment-1933470637

최종 Output에 대해 미분하는게 아니라 그 레이어의 representation과 내적을 한 뒤 미분을 함! (가장 메이저한)
image_relevance = R[:, 0, 1:] : 이 친구는 CLS 토큰을 오히려 마지막에 지우는데, chefer는 CLS 토큰에 대한 표현을 사용함. $n \times n$에서 LeGrad는 마지막 모든 row에 대해 summation을 해서 쓰고, Chefer는 CLS에 걸린 걸 씀
chefer는 residual connection을 표현하기 위해 identity matrix로 초기화해주는 부분이 있는데 그런거 없음

Result

Perturbation result

Layer Ablation

작은 모델의 경우 마지막 소수의 레이어만 쓰는게 좋았는데 모델 규모가 커질 수록 더 많은 레이어를 사용해야 함

ReLU ablation

ReLU를 키는게 좋았음. 안킨다고 엄청 나빠지진 않음.

TL;DR#

Details#

Methodology#

Result#

Perturbation result#

Layer Ablation#

ReLU ablation#

Qualitative#