[Biopython-dev] Optimization of PDBParser and friends

Thu Sep 6 10:17:03 EDT 2012

On Thu, Sep 6, 2012 at 1:52 AM, João Rodrigues <anaryin at gmail.com> wrote:

>
> What Bio.PDB does right now is rely on the list to iterate over things.
> Thus, you get the order in which you read the PDB file. However, if you
> sort it using the several Objects sort method you will get the following
> rules:
>
> Atom.py - N CA C O first, then alphabetically
> Residue.py - First aminoacids and nucleic acids, then heteroatoms.
> Chain.py - Empty chains last.
>
> These are already in place somewhere in the code. I just used them to
> overload the __cmp__ method, with a couple of additions because I
> personally disagree with the following:
>
> Atom.py - Inorganic atoms should come out last. For simplicity.
> Residue.py - If the PDB order is 151 MSE, 152 VAL, 153 CYS, you should get
> in return when you iterate: 151, 152, 153. Right now you get 152, 153, 151.
> PDB files already have weird large numbers for water and ions for example,
> so these come out last anyway. Pushing all HETATMs to the end will
> sometimes disrupt the "natural" order of things, for instance modified
> residues. Magic perhaps :)
>
>
Here's another edge case to think about:
3BEG<http://www.rcsb.org/pdb/explore/explore.do?structureId=3BEG>.
The enzyme is chain A, starting from residue number 69; the substrate
peptide is chain B; and then after listing the atoms for chain B they jump
back to chain A and add the three ligands as individual residues, with
residue numbers 1, 2 and 3, on HETATM lines.

The current PDBParser complains about this structure but parses it so that
the extra HETATM residues are at the end of chain A's child_list. If I were
to try to generate a polypeptide sequence from each of the chains in this
structure, I think I'd want to just ignore the three extra residues, rather
than list them as the first three residues of the peptide as "SAX".

How do you think this should be handled? Maybe treat in-sequence modified
residues differently from out-of-sequence HETATMs?

-E