I have a text file that contains a list of IDs (314 sequences):
AVP78031.1 AVP78042.1 ATO98108.1 ATO98120.1 ATO98132.1 ...
My goal is to make a script (maybe using Python or Perl) to check in the list if all the IDs are nucleotide or protein sequences.
AVP78031.1 -> protein (this is a nucleotide sequence, I change nucleotide for protein to show an example). AVP78042.1 -> nucleotide ATO98108.1 -> nucleotide ATO98120.1 -> nucleotide ATO98132.1 -> nucleotide
Any idea to do a script?
If these are all GenBank or RefSeq accessions, you can use Entrez Direct for this as shown below:
$ cat accs.txt ATO98108.1 ATO98120.1 ATO98132.1 AVP78031.1 AVP78042.1 $ cat accs.txt | epost -db nuccore | efetch -format acc ## no output because none of them are nucleotide accessions $ cat accs.txt | epost -db protein -format acc | efetch -format acc AVP78042.1 AVP78031.1 ATO98132.1 ATO98120.1 ATO98108.1
NOTE: This will work only if the accessions are currently live because
epost does not find any suppressed accessions. For example:
$ cat accs.txt NM_002826.3 NM_002826.4 NM_002826.5 $ cat accs.txt | epost -db nuccore -format acc | efetch -format acc NM_002826.5
Here, all three accessions are valid nucleotide accessions but only the last one, NM_002826.5, is alive.
An alternate way is to use the accession prefixes defined here and come up with an appropriate regular expression query.
Correct answer by vkkodali on April 27, 2021
1 Asked on March 24, 2021 by timd1
2 Asked on March 23, 2021 by whateversclever
1 Asked on March 22, 2021 by swa_mi
1 Asked on March 22, 2021 by nitha
1 Asked on March 20, 2021
2 Asked on March 19, 2021 by lazer-guided-lazerbeam
2 Asked on March 19, 2021 by celinedion
1 Asked on March 19, 2021 by user3390486
1 Asked on March 16, 2021 by maxno3
0 Asked on March 13, 2021 by mendel
0 Asked on March 13, 2021
1 Asked on March 13, 2021 by ryan-ward
0 Asked on March 12, 2021 by user257566
1 Asked on March 11, 2021
Get help from others!