I need some general advice and possible ideas. Problem statement goes like this -- We are given a tweet and we have to specify associated labels for it like generalized hate, support, oppose, refutation, allegation, sarcasm. The training data is ~6k tweets. However, there is very high class imbalance. Almost 90% of classes have 0s and rest are one. The approach I have tried: Classical BoW with a generic tweet preprocessing trained using SVM, XGB, Naive Bayes etc. Although accuracy is good (due to class imbalance) but AUC is very poor ~0.52. Sophisticated techniques like LSTM, GRU with Glove embedding, BERT is performing even poorer AUC <0.49. In fact for best the classifier is predicting the same label for all test data. (I tried with Minority Oversampling too, it couldn't improve the performance either). What I figured out that the BERT vocab is not recognizing most of the words and mapping it to zero. What other approaches should I try? Any leads are appreciated.

Multilabel Tweet Classification

Cross Validated Asked by Vineet on November 12, 2021

I need some general advice and possible ideas.

Problem statement goes like this —
We are given a tweet and we have to specify associated labels for it like generalized hate, support, oppose, refutation, allegation, sarcasm.

The training data is ~6k tweets. However, there is very high class imbalance. Almost 90% of classes have 0s and rest are one.

The approach I have tried:

Classical BoW with a generic tweet preprocessing trained using SVM, XGB, Naive Bayes etc. Although accuracy is good (due to class imbalance) but AUC is very poor ~0.52.
Sophisticated techniques like LSTM, GRU with Glove embedding, BERT is performing even poorer AUC <0.49. In fact for best the classifier is predicting the same label for all test data. (I tried with Minority Oversampling too, it couldn’t improve the performance either).

What I figured out that the BERT vocab is not recognizing most of the words and mapping it to zero.

What other approaches should I try? Any leads are appreciated.

machine learning multilabel text mining unbalanced classes

Add your own answers!

Ask a Question

Get help from others!