Correct algorithm for string classification

Question

I have a long list of DNA strings (of equal length) made of 4 letters (A,T,G,C). I want to do a binary classification on the strings. I have two basic quetsions:

I have a lot of duplicate strings in my dataset. Should I keep them while training?  
Usually, what is the correct machine learning / deep learning approach to problems like these?

The dataset looks like the following:

String -----------------------------------------------                         Class
ATTGCCCGCGCGCCG--------------------------- 1
AGGCGCGCAGCAGCA---------------------------2
GCGCGCAGCAGGACA---------------------------1

I have tried to divide each string into overlapping subsets of length 3,4,5 and then use TFIDF or countvectorizer to find their vector representation.Finally, I have used a classifier to train on these vectors and reported the results. But the accuracy won't go above 63%.

yiping · Answer

For sequence data, the default model is LSTM. It's able to model long sequences and has a much better representative power than linear models. Take a look at PyTorch's tutorial if you're new to it.

https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html#sphx-glr-beginner-nlp-sequence-models-tutorial-py

If I have a large enough dataset, I usually don't bother to remove the duplicates.

Correct algorithm for string classification

One Answer

Add your own answers!

Ask a Question