[150] Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization

TL;DR

I read this because.. : I read this because.. :** I thought I should know if I explain it.
task : explainability in CNN
problem : How do we attach an interpretable module that can be applied to any kind of CNN?
idea : Differentiate the activation map $A^k$ of the convolution over the class $y^c$ we want to visualize, GAP it to get the importance, and then weighted sum $A^k$ + ReLU it.
input/output : {image, class or caption or answer} -> activation map
architecture : VGG-16, AlexNet, GoogleNet
objective : X
baseline : CAM, Guided-BackProp, c-MWP
data : ILSVRC-15, PASCAL VOC 2007
evaluation : wsss, human evaluation, pointing game
result : Good descriptive power without performance degradation (CAM is degraded). good seed in wsss. good visualization of adversarial samples. Get a human to look at activated kids and classify them (trustworthy), Guided-backprop or Deconv and ask human what’s better?
contribution : Simple idea, de-facto method with no performance penalty
etc. : This is where the convention of not looking at negative gradients comes from. Read guided backprop and Network Dissection. The term “counterfactual explanation”.

Details

proposed

Differentiate the logit (before softmax) $y^c$ for the class c we want to visualize with respect to the activation feature map $A_{ij}$. We pool the global average over width, height (i, j) to get the importance.

This is then weighted-summed back to the activation map and ReLU is taken, and the GradCAM

Use the conv feature map (14 x 14 size) from the last layer (using earlier layers doesn’t perform very well) The reason for using ReLU here is that the pixels that negatively affect it would be in a different category. When ReLU was not applied, there were times when a class other than the desired class $y^c$ was activated and localization performance was poor.

guided grad-cam A 14 x 14 feature map can tell us that we’re looking at something, but it doesn’t give us a fine-grained explanation of why it’s a “tiger cat”. So I used guided backpropagation (Striving for Simplicity: The All Convolutional Net, https://arxiv.org/abs/1412.6806 ) to multiply them together and visualize them. You can use Deconv, but experimentally guided backprop is better. For the Guided backdrop, it says “negative gradients are supressed”, let’s read what it means.
counterfactual explanation

Simply taking the negative of the gradient and then taking the ReLU (which would leave only negative activation) would be a counterfactual explanation. An explanation for why this pixel is not of this class!

Result

classification result
result on captioning model
textual explanation on neuron

Network Dissection: Quantifying Interpretability of Deep Visual Representations https://arxiv.org/abs/1704.05796 Read this

result with adversarial noise

An example where a slight perturbationd to the image predicts an airliner of 0.9999. But GradCAM works fine with this.

TL;DR#

Details#

proposed#

Result#

TL;DR

Details

proposed

Result