AnswerBun.com

How to know if ID correspond to nucleotide sequence or protein with a script

Bioinformatics Asked by Mauri1313 on April 27, 2021

I have a text file that contains a list of IDs (314 sequences):

AVP78031.1
AVP78042.1
ATO98108.1
ATO98120.1
ATO98132.1
...

My goal is to make a script (maybe using Python or Perl) to check in the list if all the IDs are nucleotide or protein sequences.
For example:

AVP78031.1 -> protein (this is a nucleotide sequence, I change nucleotide for protein to show an example).
AVP78042.1 -> nucleotide
ATO98108.1 -> nucleotide
ATO98120.1 -> nucleotide
ATO98132.1 -> nucleotide

Any idea to do a script?

Thank everybody!

One Answer

If these are all GenBank or RefSeq accessions, you can use Entrez Direct for this as shown below:

$ cat accs.txt 
ATO98108.1
ATO98120.1
ATO98132.1
AVP78031.1
AVP78042.1
$ cat accs.txt | epost -db nuccore | efetch -format acc
## no output because none of them are nucleotide accessions
$ cat accs.txt | epost -db protein -format acc | efetch -format acc
AVP78042.1
AVP78031.1
ATO98132.1
ATO98120.1
ATO98108.1

NOTE: This will work only if the accessions are currently live because epost does not find any suppressed accessions. For example:

$ cat accs.txt 
NM_002826.3
NM_002826.4
NM_002826.5
$ cat accs.txt | epost -db nuccore -format acc | efetch -format acc
NM_002826.5

Here, all three accessions are valid nucleotide accessions but only the last one, NM_002826.5, is alive.

An alternate way is to use the accession prefixes defined here and come up with an appropriate regular expression query.

Correct answer by vkkodali on April 27, 2021

Add your own answers!

Related Questions

Block wise protein imputation

2  Asked on March 23, 2021 by whateversclever

     

RAD Seq Data Analysis without barcode

2  Asked on March 20, 2021 by biobash

   

FASTA and PDB: How to specify chain?

2  Asked on March 19, 2021 by lazer-guided-lazerbeam

     

How can I use my Myheritage DNA results file for further analysis?

1  Asked on March 19, 2021 by user3390486

   

Within and between sample count normalization

1  Asked on March 16, 2021 by maxno3

     

Too slow issue of BioMart

1  Asked on March 12, 2021 by user224050

   

Obtaining Whole Genetic Sequence

2  Asked on March 11, 2021

     

Ask a Question

Get help from others!

© 2023 AnswerBun.com. All rights reserved. Sites we Love: PCI Database, MenuIva, UKBizDB, Menu Kuliner, Sharing RPP, SolveDir