TransWikia.com

Correct algorithm for string classification

Cross Validated Asked by bandit_king28 on December 14, 2020

I have a long list of DNA strings (of equal length) made of 4 letters (A,T,G,C). I want to do a binary classification on the strings. I have two basic quetsions:

  1. I have a lot of duplicate strings in my dataset. Should I keep them while training?
  2. Usually, what is the correct machine learning / deep learning approach to problems like these?

The dataset looks like the following:

String ———————————————– Class
ATTGCCCGCGCGCCG————————— 1
AGGCGCGCAGCAGCA—————————2
GCGCGCAGCAGGACA—————————1

I have tried to divide each string into overlapping subsets of length 3,4,5 and then use TFIDF or countvectorizer to find their vector representation.Finally, I have used a classifier to train on these vectors and reported the results. But the accuracy won’t go above 63%.

One Answer

For sequence data, the default model is LSTM. It's able to model long sequences and has a much better representative power than linear models. Take a look at PyTorch's tutorial if you're new to it.

https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html#sphx-glr-beginner-nlp-sequence-models-tutorial-py

If I have a large enough dataset, I usually don't bother to remove the duplicates.

Answered by yiping on December 14, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP