[Biopython] Module Polypeptide

Wed Jun 10 16:29:26 EDT 2009

On Wed, Jun 10, 2009 at 6:40 PM, stanam bharat<bharat.s007 at gmail.com> wrote:
> Hi Peter,
>
> Yes, I want only the amino acid sequence with respective chain IDs.

In that case there is a much easier way - go to www.pdb.org and
find your structure and from the links on the left you can download
the PDB entry sequence as a FASTA file. In this case, the URL is:

http://www.pdb.org/pdb/download/downloadFile.do?fileFormat=FASTA&compression=NO&structureId=3FCS

> Your code works really fine. How did you write it.I mean that,
> I could not find these small basic functions like chain.id ,
> to_one_letter_code.get(res.resname,"X") in the cookbook or
> http://www.biopython.org/DIST/docs/api/Bio.PDB.Polypeptide-module.html (as I
> remember!!)

Some of this (like chain.id - that should be in the documentation?)
was just memory from having worked with the PDB parser a
couple of years ago, and I recall finding the Bio.PDB code was
quite difficult for me initially - but I learnt from it.

The to_one_letter_code thing is just a python dictionary used in
Bio.PDB.Polypeptide, which I could remember was in Bio.PDB
somewhere and on this occasion I found it just reading the
Bio.PDB source code (always worth trying if the documentation
for any python code is missing). This may not be in the
documentation - I'm not sure if Thomas intended this as a
public API or not.

A general tip for python is you can do help(object), and
dir(object) at the python primpt. Using help in this way shows
the docstring (also on our API pages online).

> My another doubt is, when you run your code or my code,
> messages like
>
> WARNING: Chain A is discontinuous at line 26340.
> WARNING: Chain B is discontinuous at line 26378.
> WARNING: Chain C is discontinuous at line 26587.
> WARNING: Chain D is discontinuous at line 26673.
> WARNING: Chain A is discontinuous at line 26802.
> WARNING: Chain B is discontinuous at line 27034.
> WARNING: Chain C is discontinuous at line 27107.
> WARNING: Chain D is discontinuous at line 27377.
>
> These are given by Parser module.

Yes - as I said in an earlier email, you should look at your PDB file to work
out what causes this (which you seem to have solved).

> Which line these messages refer to?

Those should be line numbers in the PDB file. Open the PDB file in a good
text editor, and you should be able to jump to a line number (often under
the Edit menu) to have a look.

> How can I access this information.(REMARK 465 in PDB gives info about
> missing residues.I think there is a relation between these two.).

Bio.PDB concentrates on the atomic information, but does have a basic
header parser:

from Bio.PDB.PDBParser import PDBParser
p=PDBParser(PERMISSIVE=1)
structure_id="3FCS"
filename="3FCS.pdb"
s=p.get_structure(structure_id, filename)
print s.header.keys()
print s.header["author"]

The bad news is most of the REMARK data lines are ignored - parsing
them into a useful data structure would be a pretty complicated job!

Missing residues in the atomic coordinate section could certainly trigger
those warning messages about discontinuities.  Looking at the REMARK
470 lines, some of the residues that are present are missing atoms too.
i.e. The reason getting the sequence out is difficult is due to your PDB
file missing data. Normally the polypeptide approach would be fine.

I would expect the header section of the PDB file will include the FULL
amino acid sequence (in the SEQRES lines), but my example code will
skip the missing residues (because they are simply not in the atom lines).

You probably want the full amino acid sequence, in which case you can
either manually parse the SEQRES lines (and again, turn the three letter
codes into one letter amino acids), or as I mentioned eariler, just get
the FASTA file from the PDB instead.

Peter