TransWikia.com

Extract residue sequence from pdb file (in biopython, but open to recommendations)

Bioinformatics Asked by mzzx on April 11, 2021

I’m new to Biopython and I’d like to extract the sequence of residues from a pdb file.

My two questions are:

  1. What is the simplest way to do this? (Esp. when there is more than one sequence)
    and
  2. Should I be worried about "PDBConstructionWarning: WARNING: Chain B is discontinuous"?

So far, I’ve obtained the residue sequence via:

p = PDBParser()
structure = p.get_structure("1ppi", "1ppi.pdb")
ppb=PPBuilder()
for pp in ppb.build_peptides(structure):
     print(pp.get_sequence())

seq = pp.get_sequence().__str__()

This seems to work well for this molecule. However, is there an easier way, especially when there is more than one sequence?

For example, I’ve read that one can also do

res_list = Bio.PDB.Selection.unfold_entities(structure, 'R')

but res_list is not a sequence of residues in str, and I don’t know how to convert the output from res_list to a str sequence.

In addition (or perhaps because I’m going through the PPBuilder), I’ve recently gotten a lot of warnings of the type:
/usr/local/lib/python3.8/site-packages/Bio/PDB/StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line..
(For example, with structure = p.get_structure("5owu", "5owu.pdb"))

I’ve seen a discussion about how to silence such warnings, but should I be worried about this? I’ve also noticed that when I get these warnings, pp builder seems to give me more sequences then are there.

4 Answers

  1. There are other ways to do it (many and some might be easier), but if you want to do it with BioPython PDB module you can iterate the residues like this:

     # You can use a dict to convert three letter code to one letter code
     d3to1 = {'CYS': 'C', 'ASP': 'D', 'SER': 'S', 'GLN': 'Q', 'LYS': 'K',
     'ILE': 'I', 'PRO': 'P', 'THR': 'T', 'PHE': 'F', 'ASN': 'N', 
     'GLY': 'G', 'HIS': 'H', 'LEU': 'L', 'ARG': 'R', 'TRP': 'W', 
     'ALA': 'A', 'VAL':'V', 'GLU': 'E', 'TYR': 'Y', 'MET': 'M'}
    
    
     # Just an example input pdb
     record = '1pa2.pdb'
    
     # run parser
     parser = PDBParser(QUIET=True)
     structure = parser.get_structure('struct', record)    
    
     # iterate each model, chain, and residue
     # printing out the sequence for each chain
    
     for model in structure:
         for chain in model:
             seq = []
             for residue in chain:
                 seq.append(d3to1[residue.resname])
             print('>some_headern',''.join(seq))
    
  2. Not really, is very common, you can turn it off like this (taken from https://lists.open-bio.org/pipermail/biopython/2014-July/015371.html):

     import warnings
     from Bio import PDBConstructionWarning
     #your code
     with warnings.catch_warnings():
     warnings.simplefilter('ignore', PDBConstructionWarning)
     #your code which might trigger the warning
     #rest of your code here
    
     However, because this is so common, you can just use the
     QUIET=True option on the PDBParser object:
    
     from Bio.PDB.PDBParser import PDBParser
     struct = PDBParser(QUIET=True).get_structure('tmp', pdbf)
    

Try help(PDBParser) for more, or see: http://biopython.org/DIST/docs/api/Bio.PDB.PDBParser%27.PDBParser-class.html#__init__

Correct answer by Arap on April 11, 2021

The discontinuity is normal: it is a missing stretch due to lack of density. In PyMOL you see them as dotted lines.

I am sorry to say, the easiest/safest way to get the sequences is not using the PDB files...

import requests
data = requests.get(f'https://www.ebi.ac.uk/pdbe/api/pdb/entry/molecules/{code}').json()[code.lower()]
print(data[0]['sequence'])

One gets back a lot of information: example. This will tell you when you have multiple copies of the same peptide. The only information you do not get is what the PDB numbering is relative to the canonical Uniprot for that chain, which has to be taken from SIFTS.

A side note... Biopython.PDB is one of the best BioPython packages, but I personally I much prefer PyMOL as a python module (pymol2 module) to Biopython.PDB as it is functionally more complete and straightforward even if everything operates in a stage. You might want to check it out.

with pymol2.PyMOL() as pymol:
    pymol.cmd.fetch('1UBQ', 'prot')
    print(pymol.cmd.get_fastastr('prot'))

Answered by Matteo Ferla on April 11, 2021

you can try PDBx Python Parser too:

http://mmcif.wwpdb.org/docs/sw-examples/python/html/index.html

it has an example showing exactly what you need

http://mmcif.wwpdb.org/docs/sw-examples/python/html/fasta.html

Answered by pippo1980 on April 11, 2021

You can also use the SeqIO module from Biopython https://biopython.org/docs/1.75/api/Bio.SeqIO.PdbIO.html

and do something like this:

from Bio import SeqIO

for record in SeqIO.parse(target, "pdb-atom"):
    print(record.seq)

where target is the path to your pdb file the parser has multiple input/output functions, and pdb-atom is the one that will read out of the coordinates itself.

Answered by cminima on April 11, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP