Anomaly detection in Text Classification

Question

I have built a text classifier using OneClassSVM.

I have the training set which corresponds to only one label i.e("Yes") and I don't have the other("NO") label data.
My task is to build a classifier which classifies the new unseen sentence(test data) as 1 if it is very similar to the training data. Else, it classifies as -1 i.e,(anomaly).

I have used Word2Vec to build the word embeddings for my training data.
Then, I am using word-vector averaging with OneClassSVM to build a anomaly detector classifier.

This classifier is currently giving accuracy of about 50%-55%. I have to enhance this further to build a robust classifier.

Any suggestions to this problem would be helpful...

Akbari · Answer

This paper Outlier Detection for Text Data discussed similar problem. I believe for a robust classifier you need to understand latent topics in the corpus, either with LSI approach as discussed in this paper or via a clustering approach in latent space. I think using de-noising autoencoder for learning features from sentence embedding is the most straight forward approach to obtain robust classifier.

Answered by Akbari on February 9, 2021

MkL · Answer

The question is than about your data - how representational your cases from training set are for the whole "yes" subset - ?

And what type of errors your classifier does?

You may also try to use word2vec to produce embeddings of the whole texts.

Anomaly detection in Text Classification

2 Answers

Add your own answers!

Ask a Question