TransWikia.com

Need help on Deep learning model on sentiment analysis

Data Science Asked by Vidhya Shankar on August 1, 2021

Before I present my problem, please note that I am a newbie in deep learning and I am trying things for the first time. Most of my code/logic were adopted from various references in the internet.

Goal : Build a LSTM/CNN model to classify the IMDB reviews available in tensorflow datasets

Approach 1 : 1) LSTM based – train data – 45000 (10% validation split),test data – 5000 , accuracy > 95% , validation_accuracy > 85% , glove embeddings of size 100 was used
Approach 2 : 1) CNN model – a) train data – 45000 , test_data – 5000
b) train data – 50% , test_data – 50%
accuracy > 95% , validation_accuracy > 85%

Code : https://github.com/shankartmv/Deep-Learning-Work/blob/main/IMDB_Sentiment_reviews_using_tensorflow_dataset.ipynb

Problem : Test_data accuracy doesn’t go beyond 52% with both the approaches.Most of the code/references available out there use test_data during training. test_data wasn’t part of my training.

Methodologies tried to increase test accuracy :

  1. Movie reviews length (padded) and max number of words in the vocabulary
  2. dropout
  3. epochs
  4. train test split ratio
  5. embeddings trainable=true/false
  6. with and without GLove word embeddings

My guess is there isn’t enough training data. I need help on how to increase the test data accuracy.

One Answer

Most of the code/references available out there use test_data during training. test_data wasn't part of my training.

While this is the way we should do it but stuff like Encoding must be done holistically.

In your case, you have called the pre_process separately for Test and Train.
So, the words are converted to Numbers independently. This should not happen.

tokenizer.texts_to_sequences(test)
Above Tokenizer should be the one that was fit on train data.

If I randomly print token with key 101 for train, test. This is the result

print(train_tokn.index_word[101])
print(test_tokn.index_word[101])

think
characters

I think you should use the train_tokn for the test data and it should improve. I believe a very simple LSTM can achieve 85% on this dataset
Or, manually embed both Train, Test using the GloVe embedding.


A simple example for the issue

from keras.preprocessing.text import Tokenizer

train = ['I am sorry'] 
test = ['I am very sorry']
max_words = 10 

# Train
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(train)
tokenizer.index_word # {1: 'i', 2: 'am', 3: 'sorry'}
# Test
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(test)
tokenizer.index_word  # {1: 'i', 2: 'am', 3: 'very', 4: 'sorry'}

Correct answer by 10xAI on August 1, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP