How to know if ID correspond to nucleotide sequence or protein with a script

Question

I have a text file that contains a list of IDs (314 sequences):
AVP78031.1
AVP78042.1
ATO98108.1
ATO98120.1
ATO98132.1
...

My goal is to make a script (maybe using Python or Perl) to check in the list if all the IDs are nucleotide or protein sequences.
For example:
AVP78031.1 -> protein (this is a nucleotide sequence, I change nucleotide for protein to show an example).
AVP78042.1 -> nucleotide
ATO98108.1 -> nucleotide
ATO98120.1 -> nucleotide
ATO98132.1 -> nucleotide

Any idea to do a script?
Thank everybody!

vkkodali · Accepted Answer

If these are all GenBank or RefSeq accessions, you can use Entrez Direct for this as shown below:
$ cat accs.txt 
ATO98108.1
ATO98120.1
ATO98132.1
AVP78031.1
AVP78042.1
$ cat accs.txt | epost -db nuccore | efetch -format acc
## no output because none of them are nucleotide accessions
$ cat accs.txt | epost -db protein -format acc | efetch -format acc
AVP78042.1
AVP78031.1
ATO98132.1
ATO98120.1
ATO98108.1

NOTE: This will work only if the accessions are currently live because epost does not find any suppressed accessions. For example:
$ cat accs.txt 
NM_002826.3
NM_002826.4
NM_002826.5
$ cat accs.txt | epost -db nuccore -format acc | efetch -format acc
NM_002826.5

Here, all three accessions are valid nucleotide accessions but only the last one, NM_002826.5, is alive.
An alternate way is to use the accession prefixes defined here and come up with an appropriate regular expression query.

How to know if ID correspond to nucleotide sequence or protein with a script

One Answer

Add your own answers!

Ask a Question