TransWikia.com

Get protein names corresponding to PDB ID

Bioinformatics Asked on November 13, 2020

I have a list of about 4000 PDB IDs and would like to get the actual names of the proteins (e.g. lactate dehydrogenase, cytochrome c). I tried the batch header section at the Protein Databank Download page but it refused to accept my PDB IDs in formats (xxxx or xxxx.pdb, individually or space-separated) that worked in an interactive search for the protein structure.

Any suggestions?

2 Answers

You can use one of the UniProt Protein APIs.

As you said you have your pdb entries in a text file line by line you can, like this example.txt containing:

1brr
4lzm
2dyi

Using the commandline, you can use a little script like this to download the name, if it is available for the given pdb entry.

while read line;
do 
 curl -X GET --header 'Accept:application/json' "https://www.ebi.ac.uk/proteins/api/proteins/pdb:$line" | 
 jq -r '.[].protein.recommendedName.fullName.value'  |
 sed "s/^/$linet/" >> pdb_names.txt;
done < example.txt;

You need to have curl, sed and jq installed on your system.

This gives you following output in pdb_names.txt

1brr    Bacteriorhodopsin
4lzm    Endolysin
2dyi    Ribosome maturation factor RimM

Update

if you want to speed it up, you can run it with parallel.

parallel -j 4 'curl -X GET --header "Accept:application/json" "https://www.ebi.ac.uk/proteins/api/proteins/pdb:{}" |   jq -r ".[]. .protein.recommendedName.fullName.value"  |  sed "s/^/{}t/" >> pdb_names_parallel.txt' :::: example.txt

With the -j option you call how many jobs should run in parallel. The limit of the UniProt API is 200 request per second per user.


Update 7. Nov 2020

To get another info beside the protein name, you need to know how the JSON response from UniProt looks like.

To get also the scientific name, you can run following command:

parallel -j 4 'curl -X GET --header "Accept:application/json" "https://www.ebi.ac.uk/proteins/api/proteins/pdb:{}" | jq -r ".[] | .protein.recommendedName.fullName.value + " - " + .organism.names[0].value"  |  sed "s/^/{}t/" >> pdb_names_parallel.txt' :::: example.txt

As result you get this:

1brr    Bacteriorhodopsin - Halobacterium salinarum (strain ATCC 700922 / JCM 11081 / NRC-1)
4lzm    Endolysin - Enterobacteria phage T4
2dyi    Ribosome maturation factor RimM - Thermus thermophilus (strain HB8 / ATCC 27634 / DSM 579)

Correct answer by Mr_Z on November 13, 2020

Assuming you can use R, have you tried with biomaRt? For example, using 2bhl (my PhD lover :D)

library(biomaRt)
ensembl <- useMart("ensembl",dataset="hsapiens_gene_ensembl")

# get list of all available info
filters <- listFilters(ensembl)
attributes <- listAttributes(ensembl)

getBM(attributes=c('hgnc_symbol','ensembl_gene_id','entrezgene_id',
                'protein_id','description',"superfamily"), 
      filters = 'pdb', 
      values = "2bhl", 
      mart = ensembl)

Returns

hgnc_symbol ensembl_gene_id entrezgene_id protein_id description
1         G6PD ENSG00000160211          2539   ADO22353 glucose-6-phosphate dehydrogenase [Source:HGNC Symbol;Acc:HGNC:4057]
2         G6PD ENSG00000160211          2539   CAA27309 glucose-6-phosphate dehydrogenase [Source:HGNC Symbol;Acc:HGNC:4057]
3         G6PD ENSG00000160211          2539   AAA63175 glucose-6-phosphate dehydrogenase [Source:HGNC Symbol;Acc:HGNC:4057]
4         G6PD ENSG00000160211          2539   AAA52500 glucose-6-phosphate dehydrogenase [Source:HGNC Symbol;Acc:HGNC:4057]
5         G6PD ENSG00000160211          2539   AAA52501 glucose-6-phosphate dehydrogenase [Source:HGNC Symbol;Acc:HGNC:4057]
6         G6PD ENSG00000160211          2539   CAA39089 glucose-6-phosphate dehydrogenase [Source:HGNC Symbol;Acc:HGNC:4057]
7 ...

I am sure if you look at all the available filters and attributes, you can pinpoint the ID you are looking for.

Answered by fra on November 13, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP