# Why are results without Transfer Learning better than with Transfer Learning?

Data Science Asked by Tobitor on December 23, 2020

I developed a neural network for license plate recognition and used the EfficientNet architecture (https://keras.io/api/applications/efficientnet/#efficientnetb0-function) with and without pretrained weights on ImageNet and with and without data augmentation. I only had 10.000 training images and 3.000 validation images. That was the reason I applied Transfer learning and image augmentation (AdditiveGaussianNoise).

I created this model:

efnB0_model = efn.EfficientNetB0(include_top=False, weights="imagenet", input_shape=(224, 224, 3))
efnB0_model.trainable = False

def create_model(input_shape = (224, 224, 3)):
input_img = Input(shape=input_shape)
model = efnB0_model (input_img)
model = GlobalAveragePooling2D(name='avg_pool')(model)
model = Dropout(0.2)(model)
backbone = model

branches = []
for i in range(7):
branches.append(backbone)
branches[i] = Dense(360, name="branch_"+str(i)+"_Dense_360")(branches[i])
branches[i] = BatchNormalization()(branches[i])
branches[i] = Activation("relu") (branches[i])
branches[i] = Dropout(0.2)(branches[i])
branches[i] = Dense(35, activation = "softmax", name="branch_"+str(i)+"_output")(branches[i])

output = Concatenate(axis=1)(branches)
output = Reshape((7, 35))(output)
model = Model(input_img, output)

return model


I compiled the model:

opt = keras.optimizers.Adam(learning_rate=0.0001)
model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=["accuracy"])


And used this code to fit it:

hist = model.fit(
x=training_generator, epochs=10, verbose=1, callbacks=None,
validation_data=validation_generator, steps_per_epoch=num_train_samples // 16,
validation_steps=num_val_samples // 16,
max_queue_size=10, workers=6, use_multiprocessing=True)


My hypotheses were:

H1: The EfficientNet architecture is applicable to license plate recognition.

H2: Transfer learning will improve accuracy in license plate recognition (compared to the situation without Transfer Learning).

H3: Image augmentation will improve accuracy in license plate recognition (compared to the situation without it).

H4: Transfer Learning combined with Image augmentation will bring the best results.

I now got this results:

So, H1 seems to be correct. But H2, H3 and H4 seem to wrong.

I was thinking about it and got an explanation for H3 and H4, which seem to be logical for me. That is, that image augmentation is too heavy and deteriorates the quality of images by a degree which makes it very hard for the network to recognize the characters.

1. Is this a suitable explanation and are there other ones additionally?

It seems to be the case, that image augmentation was too strong. So, first question is solved.

Regarding H2 I am little confued to be honest. The network seems to overfit but stagnates completely regarding validation accuracy. So, the conclusion that the Imagenet weights are not applicable seems not logical to me because the network learnt something for the training data. I also excluded the possibility that the data volume is to small since we had that good recognition rates without using Transfer learning or image augmentation…

2. Is there any logical explanation for this?

As @fuwiak mentioned, transfer learning may not work if pre-trained model has been fitted on a "very different" dataset. Typically if the pre-trained network extract information that is not relevant for your problem.

Moreover, in the paper License Plate Recognition System Based on Transfer Learning (that you shared with me), they have tried to freeze some layers of a pretrained Xception (based on ImageNet weights) to see the impact on the training. They conclude that ImageNet data and license plate data are too different to freeze layers. So your results are confirmed.

Now changing efnB0_model.trainable = False to True would allow the pre-trained network to update and to be more relevant to your problem. Generally, if you don't have time issues, it seems to be always better (see this post). Will it give better results than initialize the weights randomly ? I think one can presume but cannot know.

Correct answer by etiennedm on December 23, 2020

At least two issues:

Negative transfer

• Transfer learning working if the initial and our problem are similar. Unfortunately, we think that there are similar enough, but its just illusion.

Data greedy

• Often model start working well, if we provide much more data.

Answered by fuwiak on December 23, 2020

## Related Questions

### Is it possible to use TensorFlow inside a python script in Azure Machine Learning Studio?

1  Asked on January 13, 2021

### When would one use Manhattan distance as opposed to Euclidean distance?

5  Asked on January 13, 2021 by tiago

### Which probability distribution will you use to model outliers?

1  Asked on January 13, 2021

### Implementation of RMS prop for linear regression

0  Asked on January 13, 2021 by aniket-bote

### How to train a Machine Learning model for blocked data

2  Asked on January 13, 2021 by jfiedler

### How effective would a data-driven vaccination program be over a simple rules based approach

2  Asked on January 13, 2021 by andrew-pate

### Understanding projection layer for BLSTM

1  Asked on January 13, 2021 by selma_ka

### How would you build a big production ready image training dataset from scratch?

1  Asked on January 13, 2021 by basti

### What’s the input dimension for transformer decoder during TRAINING?

1  Asked on January 13, 2021 by user99347

### Parsing and storing a large amount of HTML data

1  Asked on January 12, 2021 by nainometer

### improve NER model accuracy with spaCy dependency tree

0  Asked on January 12, 2021

### Encode missing data and unseen data

1  Asked on January 12, 2021 by outcast

### Gradient Checking: MeanSquareError. Why huge epsilon improves discrepancy?

1  Asked on January 12, 2021

### Normalization and Outlier on Target variable which is continuous

1  Asked on January 12, 2021 by navneeth

### How to get used to matricial/vectorial operations?

1  Asked on January 12, 2021 by joann2555

### Physical modelling with neural networks – single output + stack ensemble vs multi-output

1  Asked on January 12, 2021 by fernikalo

### Cross validation for unbalanced dataset using Orange data mining tool

1  Asked on January 12, 2021 by emma-bartholomeeusen

### Determine how each feature contribute to XGBoost Classification

2  Asked on January 12, 2021 by tran-tran

### How to do feature selection for clustering and implement it in python?

2  Asked on January 12, 2021 by akash-dubey

### Parallel inference for 1000 different Tensorflow models using the GPU?

0  Asked on January 12, 2021 by daan-klijn