TransWikia.com

How to know if ID correspond to nucleotide sequence or protein with a script

Bioinformatics Asked by Mauri1313 on April 27, 2021

I have a text file that contains a list of IDs (314 sequences):

AVP78031.1
AVP78042.1
ATO98108.1
ATO98120.1
ATO98132.1
...

My goal is to make a script (maybe using Python or Perl) to check in the list if all the IDs are nucleotide or protein sequences.
For example:

AVP78031.1 -> protein (this is a nucleotide sequence, I change nucleotide for protein to show an example).
AVP78042.1 -> nucleotide
ATO98108.1 -> nucleotide
ATO98120.1 -> nucleotide
ATO98132.1 -> nucleotide

Any idea to do a script?

Thank everybody!

One Answer

If these are all GenBank or RefSeq accessions, you can use Entrez Direct for this as shown below:

$ cat accs.txt 
ATO98108.1
ATO98120.1
ATO98132.1
AVP78031.1
AVP78042.1
$ cat accs.txt | epost -db nuccore | efetch -format acc
## no output because none of them are nucleotide accessions
$ cat accs.txt | epost -db protein -format acc | efetch -format acc
AVP78042.1
AVP78031.1
ATO98132.1
ATO98120.1
ATO98108.1

NOTE: This will work only if the accessions are currently live because epost does not find any suppressed accessions. For example:

$ cat accs.txt 
NM_002826.3
NM_002826.4
NM_002826.5
$ cat accs.txt | epost -db nuccore -format acc | efetch -format acc
NM_002826.5

Here, all three accessions are valid nucleotide accessions but only the last one, NM_002826.5, is alive.

An alternate way is to use the accession prefixes defined here and come up with an appropriate regular expression query.

Correct answer by vkkodali on April 27, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP