[Biopython-dev] PDB tidy script

Wed Mar 25 17:44:30 EDT 2009

On Mon, Mar 23, 2009 at 5:05 PM, Peter <biopython at maubp.freeserve.co.uk>wrote:

>
> If you look back over the history, there initially was no header parsing,
> it was a contribution from Kristian Rother, and I would agree, it is rather
> disjoint from the rest of the code.  One thing I personally wanted last
> time I was working with PDB files was to have secondary structure
> information (for them alpha and beta sheet lines in the header)
> mapped onto the residue objects automatically.
>
> And yes, Thomas is supporting the PDB module, but his time has
> been rather limited of late.  When I asked him about some of the
> open enhancement requests in bugzilla recently (off list) he said
> said we needed "a separate class to parse all the info in the header,
> not a slew of additions to the core parser class (which is designed
> to deal with the 3D data only)."
>
>
I can understand both those wishes. Looking at the features currently
available in the module, the best approach might be to leave the 3D parser
and PDB.Entity-derived classes alone and add another wrapper class
containing the header, sequence (maybe), secondary and tertiary structure as
separate attributes.

When working in the REPL, I've wished for a simpler function to load PDB
files by path and figure out the name automatically; this would be an easy
way to do it without violating Thomas's parser -- just use
parse_pdb_header() in the wrapper, and use the name from there as the first
argument to PDB.get_structure(). For example (quick & dirty):

class PDBLoader:
    def __init__(self, path):
        self.__dict__ = parse_pdb_header(path)
        if not self.name:
            self.name = os.path.basename(path).split('.')[0]
        parse_3d = PDBParser()
        self.structure = parse_3d.get_structure(self.name, path)
        # self.secondary = ?
        # link 1/2/3ary data in various ways ...

>>> pdb = PDBLoader('a_structure.pdb')
>>> dir(pdb)
['__doc__', '__init__', '__module__', 'author', 'compound',
'deposition_date', 'head', 'journal_reference', 'name', 'release_date',
'resolution', 'source', 'structure', 'structure_method',
'structure_reference']

In that case, it would be reasonable to let get_structure and
parse_pdb_header take an open file-like object as an alternative to the PDB
file's path to avoid opening and closing the same file repeatedly. There's
also some cleanup to do in parse_pdb_header.py alongside this.

Does this sound reasonable?

-Eric