How does batch normalization work for convolutional neural networks

Question

I am trying to understand how batch normalization (BN) works in CNNs. Suppose I have a feature map tensor $T$ of shape $(N, C, H, W)$
where $N$ is the mini-batch size,
$C$ is the number of channels, and
$H,W$ is the spatial dimension of the tensor.
Then it seems there could a few ways of going about this (where $gamma, beta$ are learned parameters for each BN layer)
Method 1: $T_{n,c,x,y} :=  gamma*frac {T_{c,x,y} - mu_{x,y}} {sqrt{sigma^2_{x,y} + epsilon}} + beta$ where $mu_{x,y} = frac{1}{NC}sum_{n, c} T_{n,c,x,y}$ is the mean for all channels $c$ for each batch element $n$ at spatial location $x,y$ over the minibatch, and
$sigma^2_{x,y} = frac{1}{NC} sum_{n, c} (T_{n, c,x,y}-mu_{c})^2$ is the variance of the minibatch for all channels $c$ at spatial location $x,y$.
Method 2: $T_{n,c,x,y} :=  gamma*frac {T_{c,x,y} - mu_{c,x,y}} {sqrt{sigma^2_{c,x,y} + epsilon}} + beta$ where $mu_{c,x,y} = frac{1}{N}sum_{n} T_{n,c,x,y}$ is the mean for a specific channel $c$ for each batch element $n$ at spatial location $x,y$ over the minibatch, and
$sigma^2_{c,x,y} = frac{1}{N} sum_{n} (T_{n, c,x,y}-mu_{c})^2$ is the variance of the minibatch for a channel $c$ at spatial location $x,y$.
Method 3: For each channel $c$ we compute the mean/variance over the entire spatial values for $x,y$ and apply the formula as
$T_{n, c,x,y} := gamma*frac {T_{n, c,x,y} - mu_{c}} {sqrt{sigma^2_{c} + epsilon}} + beta$, where now $mu_c = frac{1}{NHW} sum_{n,x,y} T_{n,c,x,y}$ and $sigma^2{_c} = frac{1}{NHW} sum_{n,x,y} (T_{n,c,x,y}-mu_c)^2 $
In practice which of these methods is used (if any) are correct for?
The original paper on batch normalization , https://arxiv.org/pdf/1502.03167.pdf , states on page 5 section 3.2, last paragraph, left side of the page:

For convolutional layers, we additionally want the normalization to
obey the convolutional property – so that different elements of the
same feature map, at different locations, are normalized in the same
way. To achieve this, we jointly normalize all the activations in a
minibatch, over all locations. In Alg. 1, we let $mathcal{B}$ be the set of all
values in a feature map across both the elements of a mini-batch and
spatial locations – so for a mini-batch of size $m$ and feature maps of
size $p times q$, we use the effective mini-batch of size $m^prime = vert mathcal{B} vert = m cdot pq$. We learn a pair of parameters $gamma^{(k)}$ and $beta^{(k)}$ per feature map,
rather than per activation. Alg. 2 is modified similarly, so that
during inference the BN transform applies the same linear
transformation to each activation in a given feature map.

I'm not sure what the authors mean by "per feature map", does this mean per channel?

10xAI · Answer

Method 2:
 This is original batch Normalization as suggested in the paper [Ioffe & Szegedy, 2015].
It is the most common approach. It is very well explained here [d2l.ai]

Similarly, with convolutional layers, we can apply batch normalization after the convolution and before the nonlinear activation function. When the convolution has multiple output channels, we need to carry out batch normalization for each of the outputs of these channels, and each channel has its own scale and shift parameters, both of which are scalars. Assume that our minibatches contain  m  examples and that for each channel, the output of the convolution has height  p  and width  q . For convolutional layers, we carry out each batch normalization over the  m⋅p⋅q  elements per output channel simultaneously. Thus, we collect the values over all spatial locations when computing the mean and variance and consequently apply the same mean and variance within a given channel to normalize the value at each spatial location

I'm not sure what the authors mean by "per feature map", does this mean per channel?

Yes, two trainable parameters per Channel/Feature map.
Method 3:
 This was the idea suggested as "Layer Normalization". [Paper]
 It fixed the issue of BN i.e. dependence on a Batch of data and also it worked for sequence data. But the paper didn't claim anything great for CNN.

We have also experimented with convolutional neural networks. In our preliminary experiments, we observed that layer normalization offers a speedup over the baseline model without normalization, but batch normalization outperforms the other methods. With fully connected layers, all the hidden units in a layer tend to make similar contributions to the final prediction and re-centering and rescaling the summed inputs to a layer works well. However, the assumption of similar contributions
is no longer true for convolutional neural networks. The large number of the hidden units whose receptive fields lie near the boundary of the image are rarely turned on and thus have very different statistics from the rest of the hidden units within the same layer. We think further research is needed to make layer normalization work well in ConvNets

Method 1:
 This is averaging across the Feature Maps on every pixel. I am not sure of its aplication.

How does batch normalization work for convolutional neural networks

One Answer

Add your own answers!

Ask a Question