In GradCAM, why is activation strength considered an indicator of relevant regions?

Question

In the GradCAM paper section 3 they implicitly propose that two things are needed to understand which areas of an input image contribute most to the output class (in a multi-label classification problem). That is:

$A^k$ the final feature maps
$alpha_k^c$ the average pooled partial derivatives of the output class scores $y^c$ with respect to the the final feature maps $A_k$.

The second point is clear to me. The stronger the derivative, the more important the $k$th channel of the final feature maps is.
The first point is not, because the implicit assumption is that non-zero activations have more significance than activations close to zero. I know it's tempting to take that as a given, but for me it's not so obvious. After all, neurons have biases, and a bias can arbitrarily shift the reference point, and hence what 0 means. We can easily transform two neurons [0, 1] to [1, 0] with a linear transformation.
So why should it matter which regions of the final feature maps are strongly activated?

EDIT
To address a comment further down, this table explains why I'm thinking about magnitude rather than sign of the activations.

It comes from thinking about the possible variations of
$$
L_{Grad-CAM}^c = ReLUbigl( sum_k alpha_k^c A^k bigr)
$$

desertnaut · Accepted Answer

I think you are misreading the relevant passage here.
Since you do not specify exact excerpt(s), I take that by "implicit assumption" you refer to the equation (2) (application of a ReLU) and the corresponding text explanation (bold emphasis mine):

We apply a ReLU to the linear combination of maps because we are only
interested in the features that have a positive influence on the class
of interest, i.e. pixels whose intensity should be increased in order
to increase $y^c$. Negative pixels are likely to belong to other
categories in the image. As expected, without this ReLU, localization maps sometimes highlight more than just the desired class and perform worse at localization.

The first thing to notice here is that this choice is not at all about activations close to zero, as you seem to believe,  but about negative ones; and since negative activations are indeed likely to belong to other categories/classes than the one being "explained" at a given trial, it is very natural to exclude them using a ReLU.
Grad-CAM maps are essentially localization ones; this is apparent already from the paper abstract (emphasis mine):

Our approach – Gradient-weighted Class Activation Mapping (Grad-CAM), uses the gradients of any target concept (say ‘dog’  in  a classification  network  or  a  sequence  of  words in captioning network) flowing into the final convolutional layer to produce a coarse localization map highlighting the important regions in the image for predicting the concept.

and they are even occasionally referred to as "Grad-CAM localizations" (e.g. in the caption of Fig. 14); taking a standard example figure from the paper, e.g. this part of Fig. 1:

it is hard to see how including the negative values of the map (i.e. removing the thresholding imposed by the ReLU) would not lead to maps that include irrelevant parts of the image, hence resulting in a worse localization.

A general remark: while your claim that

After all, neurons have biases, and a bias can arbitrarily shift the reference point, and hence what 0 means

is correct as long as we treat the network as an arbitrary mathematical model, we can no longer treat a trained network as such. For a trained network (which Grad-CAM is all about), the exact values of both biases & weights matter, and we cannot transform them arbitrarily.

UPDATE (after comments):

Are you pointing out that it's called "localization" and therefore must be so?

It is called "localization" because it is localization, literally ("look, here is the "dog" in the picture, not there").

I could make a similar challenge "why does positive mean X and why does negative mean Y, and why will this always be true in any trained network?"

It is not at all like that; positive means X and negative means not-X in the presence of class X (i.e. a specific X is present), in the specific network, and all this in a localization context; notice that Grad-CAM for "dog" is different from the one for "cat" in the picture above.

Why is it that a trained network tends to make 0 mean "insignficant" in the deeper layers? [...] why is it that a network should invariably make the positive activations be the ones that support the prediction rather than the negative ones? Couldn't a network just learn it the other way around but use a negative sign on the weights of the final dense layer (so negative flips to positive and thus supports the highest scoring class)?

Again, beware of such symmetry/invariability arguments when such symmetries/invariabilities are broken; and they are indeed broken here for a very simple reason (albeit hidden in the context), i.e. the specific one-hot encoding of the labels: we have encoded "cat" and "dog" as (say) [0, 1] and [1, 0] respectively, so, since we are interested in these 1s (which indicate class presence), it makes sense to look for the positive activations of the (late) convolutional layers. This breaks the positive/negative symmetry. Should we had chosen to encode them as [0, -1] and [-1, 0] respectively ("minus-one-hot encoding"), then yes, your argument would hold, and we would be interested in the negative activations. But since we take the one-hot encoding as given, the problem is no longer symmetric/invariant around zero - by using the specific label encoding, we have actually chosen a side (and thus broken the symmetry)...

In GradCAM, why is activation strength considered an indicator of relevant regions?

One Answer

Add your own answers!

Ask a Question