Gradient Class Activation Map (Grad-CAM) for a particular category indicates the discriminative image regions used by the CNN to identify that category.
The goal of this blog is to:
- understand concept of Grad-CAM
- understand Grad-CAM is generalization of CAM
- understand how to use it using keras-vis
- implement it using Keras's backend functions.
Reference¶
- Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
- Deep Learning: Class Activation Maps Theory
- keras-vis
Reference in this blog¶
- Visualization of deep learning classification model using keras-vis
- Saliency Map with keras-vis
- Grad-CAM with keras-vis
To set up the same conda environment as mine, follow:
Visualization of deep learning classification model using keras-vis
Setup¶
import keras
import tensorflow as tf
import vis ## keras-vis
import matplotlib.pyplot as plt
import numpy as np
print("keras {}".format(keras.__version__))
print("tensorflow {}".format(tf.__version__))
Read in pre-trained model¶
For this exersize, I will use VGG16.
from keras.applications.vgg16 import VGG16, preprocess_input
model = VGG16(weights='imagenet')
model.summary()
for ilayer, layer in enumerate(model.layers):
print("{:3.0f} {:10}".format(ilayer, layer.name))
Download a json file containing ImageNet class names.
wget "https://raw.githubusercontent.com/raghakot/keras-vis/master/resources/imagenet_class_index.json"
Read in the class index json file
import json
CLASS_INDEX = json.load(open("imagenet_class_index.json"))
classlabel = []
for i_dict in range(len(CLASS_INDEX)):
classlabel.append(CLASS_INDEX[str(i_dict)][1])
print("N of class={}".format(len(classlabel)))
Let's read in an image that contains both dog and cat. Clearly this image would be very confusing for a model trained with ImageNet, which often has a single object per image.
The goal of this exersise is to understand why VGG16 model makes classification decision.
from keras.preprocessing.image import load_img, img_to_array
#_img = load_img("duck.jpg",target_size=(224,224))
_img = load_img("dog_and_cat.jpg",target_size=(224,224))
plt.imshow(_img)
plt.show()
Let's predict the object class of this image, and show the top 5 predicted classes.
Unfortunately, the top 2 predicted class does not make any sense. Why towel?? However, you see that other predicted classes are dogs and kind of makes sense:
Top 1 predicted class: Pr(Class=redbone [index=168])=0.360
Top 3 predicted class: Pr(Class=bloodhound [index=163])=0.076
Top 4 predicted class: Pr(Class=basenji [index=253])=0.042
Top 5 predicted class: Pr(Class=golden_retriever [index=207])=0.041
img = img_to_array(_img)
img = preprocess_input(img)
y_pred = model.predict(img[np.newaxis,...])
class_idxs_sorted = np.argsort(y_pred.flatten())[::-1]
topNclass = 5
for i, idx in enumerate(class_idxs_sorted[:topNclass]):
print("Top {} predicted class: Pr(Class={:18} [index={}])={:5.3f}".format(
i + 1,classlabel[idx],idx,y_pred[0,idx]))
Grad-CAM¶
Last Convolusional Layer of the CNN¶
Let $A^k \in \mathbb{R}^{u\textrm{ x } v}$ be the $k$th feature map ($k=1,\cdots,K$) from the last convolusional layer. The dimension of this feature map is height $u$ and width $v$. For example, in the case of VGG16, $u=14, v=14, K=512$. Grad-CAM utilize these $A^k$ to visualize the decision made by CNN.
Some observations¶
- Convolutional feature map retain spatial information (which is lost in fully-connected layers)
- Each kernel represents some visual patterns. For example, one kernel may capture a dog, another kernel may capture bird etc.
- Each pixel of the feature map indicates whether the corresponding kernel's visual pattern exists in its receptive fields.
Last Convolusional Layer can be thought as the features of a classification model.
$$ y^c = f(A^1,...,A^{512})$$
Idea¶
- Visualization of the final feature map ($A^k$) will show the discriminative region of the image.
- Simplest summary of all the $A^k,k=1,...,K$ would be its linear combinations with some weights.
Some feature maps would be more important to make a decision on one class than others, so weights should depend on the class of interest.
$$L^c_{Grad-CAM} \sim = \sum_{k=1}^K \alpha_k^c A^k \in \mathbb{R}^{u\textrm{ x } v}$$
So the question is, what the weights $\alpha_k^c$ should be?
Calculating $\alpha_{k}^c$¶
The gradient of the $c$th class score with respect to feature maps $A^{k}$
$$\frac{d y^c}{d A^k_{i,j}}$$
measures linear effect of $(i,j)$th pixel point in the $k$th feature map on the $c$th class score. So averaging pooling of this gradient across $i$ and $j$ explains the effect of feature map $k$ on the $c$th class score.
Grad-CAM propose to use this averaged gradient score as a weights for feature map. $$ \alpha_{k}^c = \frac{1}{uv} \sum_{i=1}^u \sum_{j=1}^v \frac{d y^c}{d A^k_{i,j}} $$
Finally, $L^c_{Grad-CAM}$ is defined as: $$ L^c_{Grad-CAM} = ReLU\left( \sum_{k=1}^K \alpha_k^c A^k \right)\in \mathbb{R}^{u\textrm{ x } v} $$
ReLU is applied to the linear combination of maps
because we are only interested in the features that have a positive influence on the class of interest, i.e., pixels whose intensity should be increased in order to increase $y^c$.
Finally, we upsample the class activation map to the size of the input image to identify the image regions most relevant to the particular category.
Note: As in Saliency Map, the softmax activation of the final layer is replaced with linear. See discussion in previous blog Saliency Map with keras-vis.
Understand Grad-CAM in special case: Network with Global Average Pooling¶
GoogLeNet or MobileNet belongs to this network group. The network largely consists of convolutional layers, and just before the final output layer, global average pooling is applied on the convolutional feature maps, and use those as features for a fully-connected layer that produces the desired output (categorial or otherwise).
Given this simple connectivity structure of this natwork, Grad-CAM has an easy interpretation.
The weights $\alpha_k^c$ are simply the weights of the final fully connected layers. See the proof below:
Now we understand the concept of Grad-CAM. Let's use it with keras-vis API. First we replace the final activation to linear.
from vis.utils import utils
# Utility to search for layer index by name.
# Alternatively we can specify this as -1 since it corresponds to the last layer.
layer_idx = utils.find_layer_idx(model, 'predictions')
# Swap softmax with linear
model.layers[layer_idx].activation = keras.activations.linear
model = utils.apply_modifications(model)
Calculate Grad-CAM
from vis.visualization import visualize_cam
penultimate_layer_idx = utils.find_layer_idx(model, "block5_conv3")
class_idx = class_idxs_sorted[0]
seed_input = img
grad_top1 = visualize_cam(model, layer_idx, class_idx, seed_input,
penultimate_layer_idx = penultimate_layer_idx,#None,
backprop_modifier = None,
grad_modifier = None)
Visualization
def plot_map(grads):
fig, axes = plt.subplots(1,2,figsize=(14,5))
axes[0].imshow(_img)
axes[1].imshow(_img)
i = axes[1].imshow(grads,cmap="jet",alpha=0.8)
fig.colorbar(i)
plt.suptitle("Pr(class={}) = {:5.2f}".format(
classlabel[class_idx],
y_pred[0,class_idx]))
plot_map(grad_top1)
Observations¶
- model actually focus mostly on the dog rather than cat. Probablly because half of the ImageNet classes are dog-related.
- the intensity of Grad-CAM for the towel distributes everywhere.
for class_idx in class_idxs_sorted[:topNclass]:
grads = visualize_cam(model,layer_idx,class_idx, seed_input,
penultimate_layer_idx = penultimate_layer_idx,
backprop_modifier = None,
grad_modifier = None)
plot_map(grads)
Grad-CAM by hand¶
We know how the Grad-CAM works so let's implement it by hand.
import keras.backend as K
from scipy.ndimage.interpolation import zoom
## select class of interest
class_idx = class_idxs_sorted[0]
## feature map from the final convolusional layer
final_fmap_index = utils.find_layer_idx(model, 'block5_conv3')
penultimate_output = model.layers[final_fmap_index].output
## define derivative d loss^c / d A^k,k =1,...,512
layer_input = model.input
## This model must already use linear activation for the final layer
loss = model.layers[layer_idx].output[...,class_idx]
grad_wrt_fmap = K.gradients(loss,penultimate_output)[0]
## create function that evaluate the gradient for a given input
# This function accept numpy array
grad_wrt_fmap_fn = K.function([layer_input,K.learning_phase()],
[penultimate_output,grad_wrt_fmap])
## evaluate the derivative_fn
fmap_eval, grad_wrt_fmap_eval = grad_wrt_fmap_fn([img[np.newaxis,...],0])
# For numerical stability. Very small grad values along with small penultimate_output_value can cause
# w * penultimate_output_value to zero out, even for reasonable fp precision of float32.
grad_wrt_fmap_eval /= (np.max(grad_wrt_fmap_eval) + K.epsilon())
print(grad_wrt_fmap_eval.shape)
alpha_k_c = grad_wrt_fmap_eval.mean(axis=(0,1,2)).reshape((1,1,1,-1))
Lc_Grad_CAM = np.maximum(np.sum(fmap_eval*alpha_k_c,axis=-1),0).squeeze()
## upsampling the class activation map to th esize of ht input image
scale_factor = np.array(img.shape[:-1])/np.array(Lc_Grad_CAM.shape)
_grad_CAM = zoom(Lc_Grad_CAM,scale_factor)
## normalize to range between 0 and 1
arr_min, arr_max = np.min(_grad_CAM), np.max(_grad_CAM)
grad_CAM = (_grad_CAM - arr_min) / (arr_max - arr_min + K.epsilon())
Visualize 14 x 14 $L^c_{GRAD-CAM}$¶
plt.imshow(Lc_Grad_CAM)
plt.show()
Visualize the weights $\alpha_k^c$¶
Visualization of the weights explain which feature map is most important for this class.
plt.figure(figsize=(20,5))
plt.plot(alpha_k_c.flatten())
plt.xlabel("Feature Map at Final Convolusional Layer")
plt.ylabel("alpha_k^c")
plt.title("The {}th feature map has the largest weight alpha^k_c".format(
np.argmax(alpha_k_c.flatten())))
plt.show()
Using the keras-vis, visualize_activation(), we can visualize the image that maximize this most influential activation map. Does it make any sense??
from vis.visualization import visualize_activation
activation_max = visualize_activation(model,
layer_idx = final_fmap_index,
max_iter = 100,
verbose = True,
filter_indices = 155)
print(activation_max.shape)
plt.imshow(activation_max)
plt.show()
Make sure that the hand-calculated grad-CAM is the same as the output of keras-vis
assert np.all(np.abs(grad_CAM - grad_top1) < 0.0001)
plot_map(grad_CAM)