Need help on Deep learning model on sentiment analysis

Question

Before I present my problem, please note that I am a newbie in deep learning and I am trying things for the first time. Most of my code/logic were adopted from various references in the internet.
Goal : Build a LSTM/CNN model to classify the IMDB reviews available in tensorflow datasets
Approach 1 : 1) LSTM based - train data - 45000 (10% validation split),test data - 5000 , accuracy > 95% , validation_accuracy > 85% , glove embeddings of size 100 was used
Approach 2 : 1) CNN model - a) train data - 45000 , test_data - 5000
b) train data - 50% , test_data - 50%
accuracy > 95% , validation_accuracy > 85%
Code :  https://github.com/shankartmv/Deep-Learning-Work/blob/main/IMDB_Sentiment_reviews_using_tensorflow_dataset.ipynb
Problem : Test_data accuracy doesn't go beyond 52% with both the approaches.Most of the code/references available out there use test_data during training.  test_data wasn't part of my training.
Methodologies tried to increase test accuracy :

Movie reviews length (padded) and max number of words in the vocabulary
dropout
epochs
train test split ratio
embeddings trainable=true/false
with and without GLove word embeddings

My guess is there isn't enough training data. I need help on how to increase the test data accuracy.

10xAI · Accepted Answer

Most of the code/references available out there use test_data during training. test_data wasn't part of my training.

While this is the way we should do it but stuff like Encoding must be done holistically.
 In your case, you have called the pre_process separately for Test and Train.
 So, the words are converted to Numbers independently. This should not happen.
tokenizer.texts_to_sequences(test)
Above Tokenizer should be the one that was fit on train data.
If I randomly print token with key 101 for train, test. This is the result

print(train_tokn.index_word[101])
print(test_tokn.index_word[101])

think
characters

I think you should use the train_tokn for the test data and it should improve. I believe a very simple LSTM can achieve 85% on this dataset
Or, manually embed both Train, Test using the GloVe embedding.
 A simple example for the issue
from keras.preprocessing.text import Tokenizer

train = ['I am sorry'] 
test = ['I am very sorry']
max_words = 10

# Train
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(train)
tokenizer.index_word # {1: 'i', 2: 'am', 3: 'sorry'}
# Test
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(test)
tokenizer.index_word  # {1: 'i', 2: 'am', 3: 'very', 4: 'sorry'}

Need help on Deep learning model on sentiment analysis

One Answer

Add your own answers!

Ask a Question