TransWikia.com

What classifier could predict spam/ham labels for SMS messages better than Naive Bayes?

Cross Validated Asked on December 15, 2021

I have 7000 SMS messages, 6000 ham, 1000 spam. Typical messages are:

Ham: Yo, any way we could pick something up tonight?
Spam: Great News! Call FREEFONE 08006344447 to claim your guaranteed £1000 CASH or £2000 gift.

I want to implement a supervised classifier that would predict the ham/spam label given a new SMS.

The two classifiers have I tried are as follows:

  • Simple-predictor, where I count how many elements in the following keywords

     [
        "!", "click", "visit", "reply", "subscribe", "free", "price", "offer",
        "claim code", "charge", "stop", "unlimited", "expires", "£",
        "new voicemail", "cash prize", "special-call"
     ]
    

    are substrings of the (decapitalized) SMS message and predict spam if the count is greater than 1, ham otherwise. The method achieves

    accuracy (correct guesses ratio): 0.9742822966507177
    sensitivity (correct spam guesses ratio): 0.8452380952380952
    
  • Bayes (monograms) predictor, where I split the SMS into a tokens list $L = [t_1, t_2, …, t_n]$ (e.g. for the ham message above $L$ would be ['yo', 'any', 'way', ..., 'tonight']), compare the quantities :

    • $s = P(spam) cdot P(t_1 | spam) cdot ldots cdot P(t_n | spam)$,

    • $h = P(ham) cdot P(t_1 | ham) cdot ldots cdot P(t_n | ham)$,

    and predict spam if $s > h$, ham otherwise.

    $P(spam), P(ham), P(token | spam), P(token | ham)$ are estimated from the training data.

    This method achieves

      accuracy: 0.9881889763779528
      sensitivity: 0.9312977099236641
    

    when trained on 4000 messages and tested on the other 3000 messages

What new idea could I try to obtain a classifier with better prediction scores?

Note that I have already tried ‘tuning’ both Simple-predictor (e.g. trying different keywords list, changing the count threshold, etc.) and Bayes predictor (e.g. performance of bigrams predictor is worse due to a limited training set size) to achieve these scores. Now I am looking for a new idea.

One Answer

Basically any method text classification can be applied here. If you want to stick with classical MT methods, you can try:

  • Different model (logistic regression, SVM),

  • Feature engineering (e.g., replacing all phone numbers with a special token, removing stop words, in case of discriminative models, you can weight the input with TF-IDF scores, including n-gram features),

  • Word embeddings (such as GloVe or FastText) as an input of a discriminative model.

If you do not care about the inference time, you can try some neural models. 7k messages should be enough to train a small LSTM classifier and definitely enough to fine-tune BERT or RoBERTa.

Answered by Jindřich on December 15, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP