TransWikia.com

Why is it important to have stable activations in the forward pass?

Data Science Asked on February 8, 2021

In many sources on the literature on weight initialization, I find the idea that it’s a good idea to keep the activations stable through the layers, that is making sure they stay about the same size/order of magnitude as you go through the layers. Sometimes it’s implied that this has something to do with avoiding exploding or vanishing gradients.

For example in this blog article:

During the forward step, the activations (and then the gradients) can quickly get really big or really small — this is due to the fact that we repeat a lot of matrix multiplications. Either of these effects is fatal for training.

But why? To me this seems like an non sequitur, I don’t see why unstable activations would imply unstable gradients. The only thing I can think of is that if we have an edge with weight $w$ connecting a node with activation $f(x)$ into a node with pre-activation $y$, we have the following formula:

$$
frac{partial C}{partial w} =
f(x)frac{partial C}{partial y}
$$

Is it just because the activation $f(x)$ appears in that formula? I’m not sure because no source that I’ve found explicitly says this. But what if the differential $frac{partial C}{partial y}$ cancels out the effect of $f(x)$, for instance if the activations get really big as we go through the layers, but the differentials get really small, and it cancels out?

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP