TransWikia.com

How do Ioffe & Szegedy obtain the equation $frac{partial text{BN}((aW)u)}{partial u} = frac{partialtext{BN}(Wu)}{u}$?

Data Science Asked on April 27, 2021

In the paper that introduced Batch Normalization, on page 5, the authors write the equation

$$frac{partial text{BN}((aW)u)}{partial u} = frac{partialtext{BN}(Wu)}{u}$$

Here $W$ is the matrix of weights connecting the layer $u$ to the next, batch-normalized layer, so the conclusion is that scaling the weights by a constant doesn’t affect this partial derivative.

This seems false to me. Let $b$ be the value of some output neuron and $u$ the layer above, so that:

$$b=sum w_iu_i$$

Now let $hat b$ be the batch-normalized version of $b$:

$$hat b = b – frac1Nsum b^i$$

where by $b^i$ I mean the value of the neuron $b$ for the $i$-th training input in the batch. We have

$$hat b = sum w_iu_i – frac1Nsum_j b^j= sum w_iu_i – frac 1Nsum_j (sum_i w_ju_j^i)$$

Since the values $u_i$ never appear in the second sum we simply have

$$partial_{u_i} hat b = w_i$$

Which very much does scale with $W$. Am I making a mistake, or misinterpreting the original equation?

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP