Why am I getting good validation scores, but poor test scores in Kaggle competition

Question

I am participating in a Kaggle multiclass classification competition. The submissions will be scored based on the 'logloss' score. I am using Keras and Scikit libraries and a deep learning network model and have taken the below approach.
I have corrected class imbalance in the training data using oversampling the minority classes. I have split the training data into training (X_train, y_train) and validation datasets (X_test, y_test). I have scaled the features and I have done categorical encoding of labels.
When I run the model, I am getting very good Validation loss (1.708) and Validation accuracy (compared to Kaggle leaderboard scores; top logloss score is 1.744), but when I submit my predicted probabilities for different classes for the test_set, I am getting awfully high loss score (4+) (It is a different matter I got a different, decent score - 2.02, using a different model approach, which is reflected in the leaderboard).
Why is this? Any suggestions on what should be done or where I am going wrong?
total classes:

Class_3    51811
Class_7    51811
Class_2    51811
Class_5    51811
Class_1    51811
Class_9    51811
Class_6    51811
Class_8    51811
Class_4    51811
Name: target, dtype: int64
466299

X_train, X_test, y_train, y_test = tts(X, y,test_size =.3, stratify=y, random_state=9)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(326409, 75)
(326409, 9)
(139890, 75)
(139890, 9)

display(X_train.head(3))
display(X_test.head(3))
display(y_train[:3])
display(y_test[:3])

feature_0   feature_1   feature_2   feature_3   feature_4   feature_5   feature_6   feature_7   feature_8   feature_9   ...     feature_65  feature_66  feature_67  feature_68  feature_69  feature_70  feature_71  feature_72  feature_73  feature_74
425643  0   0   0   0   0   0   0   0   0   0   ...     0   0   0   0   3   0   1   0   0   0
303754  2   3   2   2   5   0   0   1   1   1   ...     1   0   0   0   0   0   0   4   6   0
80710   2   8   2   0   18  2   0   2   1   3   ...     0   0   4   1   0   3   0   0   1   0

3 rows × 75 columns
    feature_0   feature_1   feature_2   feature_3   feature_4   feature_5   feature_6   feature_7   feature_8   feature_9   ...     feature_65  feature_66  feature_67  feature_68  feature_69  feature_70  feature_71  feature_72  feature_73  feature_74
300226  0   0   1   4   0   0   0   4   1   1   ...     1   0   1   0   0   1   0   0   2   2
124793  0   0   0   6   0   0   0   3   7   2   ...     0   0   0   0   0   0   0   0   0   0
439437  0   3   0   0   5   0   0   2   1   1   ...     2   0   0   0   3   0   4   0   0   0

3 rows × 75 columns

array([[0., 0., 0., 0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0.]], dtype=float32)

array([[0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1.]], dtype=float32)

print(X_train.index.isin(X_test.index).sum())
print(X_test.index.isin(X_train.index).sum())
0
0

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)
test_set = scaler.fit_transform(test_set)

from keras.optimizers import Adam
from tensorflow.keras import layers

model = Sequential()
model.add(Dense(1024, input_shape=(75,), activation='relu'))
model.add(Dense(256, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(16, activation='relu'))
model.add(Dense(9, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=.001), metrics=['accuracy'], )

from tensorflow.keras.callbacks import EarlyStopping
monitor_val_acc = EarlyStopping(monitor='val_loss', patience=5)
model.fit(X_train, y_train, epochs = 50, validation_split=.3, callbacks= [monitor_val_acc], batch_size=1024)
accuracy = model.evaluate(X_test, y_test)[1]
print('Accuracy:', accuracy)

............
Epoch 28/30
45/45 [==============================] - 5s 117ms/step - loss: 1.6676 - accuracy: 0.3626 - val_loss: 1.7675 - val_accuracy: 0.3333
Epoch 29/30
45/45 [==============================] - 5s 114ms/step - loss: 1.6140 - accuracy: 0.3809 - val_loss: 1.7815 - val_accuracy: 0.3357
Epoch 30/30
45/45 [==============================] - 5s 117ms/step - loss: 1.5942 - accuracy: 0.3869 - val_loss: 1.7126 - val_accuracy: 0.3563
4372/4372 [==============================] - 11s 2ms/step - loss: 1.7085 - accuracy: 0.3582
Accuracy: 0.3581957221031189

from sklearn.metrics import accuracy_score
from sklearn.metrics import log_loss
preds_val = model.predict(X_test)

preds_val[:3]
array([[1.13723904e-01, 5.20741269e-02, 4.70720865e-02, 1.59640312e-02,
        1.92086305e-02, 2.25828230e-01, 1.81854114e-01, 1.99746847e-01,
        1.44528091e-01],
       [6.04994688e-03, 1.40825182e-01, 9.95656699e-02, 5.96038415e-04,
        5.59030111e-09, 4.57442701e-02, 3.05081338e-01, 1.77178025e-01,
        2.24959582e-01],
       [6.54266328e-02, 9.87399742e-02, 1.07230745e-01, 1.46904245e-01,
        6.80148089e-03, 1.52257413e-01, 1.22348621e-01, 1.58026025e-01,
        1.42264828e-01]], dtype=float32)

log_loss(y_test, preds_val)
1.708450169537806

Why am I getting good validation scores, but poor test scores in Kaggle competition

Add your own answers!

Ask a Question