Transfer Learning for CNNs and Batch Norm Layers

Question

In some transfer learning models, we set the training argument to False to maintain the pre-trained values of Batch Normalization for example, but the trainable attribute to False to freeze the weights. Then the new "top-layer" is added and we re-train the model. Afterwards for fine-tuning, we can re-train the weights and set the trainable attribute to True. However, what does the argument training=True do for a layer? The stackoveflow answer here does not make sense to me. When the argument training is True, that, to me, implies we are doing some type of learning: i.e. the BN mean and variance are being updated, Dropout is being applied, and the weights are from the backwards pass. What is the difference between training=True and training=False? The Keras FAQ states it just means inference is being performed, but what was being trained when training=True?
Lastly, this is being nit-picky, but in this notebook Google does transfer learning with the MobileNet V2 model. In same from above they use Xception model. Both models have BN, but in the second tutorial, they pass the training=False argument in the base_model implying do not update BN. Whereas in the first they make no mention of training=False. Why might that be? I see the first one is copyrighted in 2019 and the second one in 2020, which might imply the discrepancy.

10xAI · Answer

but what was being trained when training=True?

Let's try to understand BatchNormalization(BN) layer first as it has more elements.
TL;DR -
γ, β are learned. These are initialized just like normal weights and learned in Backpropagation.
May read this crisp and spot-on answer on these parm Stat.SE

Formally, BN transforms the activations at a given layer  x  according to the following expression:
 BN(x)= γ⊙(x−μ)/σ + β
coordinate-wise scaling coefficients  γ  and offsets  β.
[Quoted - http://d2l.ai/]

each BN layer adds four parameters per input: γ, β, μ, and σ (for example, the first BN layer adds 3,136 parameters, which is 4 × 784). The last two parameters, μ and σ, are the moving averages; they are not affected by backpropagation, so Keras calls them “non-trainable”. However, they are estimated during training, based on the training data, so arguably they are trainable. In Keras, “non-trainable” really means “untouched by backpropagation.”
[Quoted - Hands-on machine learning with scikit-learn keras and tensorflow, Aurélien Géron

training=True: The layer will normalize its inputs using the mean and variance of the current batch of inputs.
training=False: The layer will normalize its inputs using the mean and variance of its moving statistics, learned during training. [Quoted - Keras doc for BN]

So, if you will not set it False it will continue updating μ and σ with every batch of test data example and normalize the output accordingly. We want it to use the values from the Training phase.
By default, it is False and fit method set it to True.

Dropout
Dropout is simpler of the two. We need this Flag here so that we can compensate(during testing) the loss to the Output value(on an average basis) due to the switched-off(during Training) Neurons.

Suppose p = 50%, in which case during testing a neuron would be connected to twice as many input neurons as it would be (on average) during training. To compensate for this fact, we need to multiply each neuron’s input connection weights by 0.5 after training. If we don’t, each neuron will get a total input signal roughly twice as large as what the network was trained on and will be unlikely to perform well. More generally, we need to multiply each input connection weight by the keep probability (1 – p) after training. Alternatively, we can divide each neuron’s output by the keep probability during training (these alternatives are not perfectly equivalent, but they work equally well)
[Quoted - Hands-on machine learning with scikit-learn keras and tensorflow, Aurélien Géron

On the example of Models difference
Though, these are subjects to try and check.
 But I believe generally we start to fine-tune when we believe that the upper layer is smoothened to match with the initial layers, to avoid a large flow in forward and backdrop. So the logic stated to keep it Flase in 2019 example might not hold too strong every time.

Transfer Learning for CNNs and Batch Norm Layers

One Answer

Dropout

On the example of Models difference

Add your own answers!

Ask a Question