TransWikia.com

Multilabel Tweet Classification

Cross Validated Asked by Vineet on November 12, 2021

I need some general advice and possible ideas.

Problem statement goes like this —
We are given a tweet and we have to specify associated labels for it like generalized hate, support, oppose, refutation, allegation, sarcasm.

The training data is ~6k tweets. However, there is very high class imbalance. Almost 90% of classes have 0s and rest are one.

The approach I have tried:

  • Classical BoW with a generic tweet preprocessing trained using SVM, XGB, Naive Bayes etc. Although accuracy is good (due to class imbalance) but AUC is very poor ~0.52.
  • Sophisticated techniques like LSTM, GRU with Glove embedding, BERT is performing even poorer AUC <0.49. In fact for best the classifier is predicting the same label for all test data. (I tried with Minority Oversampling too, it couldn’t improve the performance either).

What I figured out that the BERT vocab is not recognizing most of the words and mapping it to zero.

What other approaches should I try? Any leads are appreciated.

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP