[BioPython] PDB parser

Thomas Hamelryck thamelry@vub.ac.be
Sat, 4 May 2002 23:53:11 -0700


Hi Catherine,

I was wondering what you want to use the PDB/Structure object for, i.e.
looking at structural information, extraction information from the header
files, etc.? I think it is difficult to make one Structure object that does
everything that people expect. I wrote down some considerations below.
Everybody feel free to criticize & comment of course.

In the most frequent case, you want a convenient access to the data via the
Structure/Model/Chain/Residue/Atom (SMCRA, in short) hierarchy. Let's say
you want to use a single superfamily representative from the SCOP
(Structural Classification Of Proteins). In that case, you will need to
extract a number of domains from a set of structures. Each domain is
specified in the SCOP definition by its structure, chain(s) and residues. So
it would be convenient to have a class (let's say the Structure class) that
allows a flexible use of the SMCRA hierarchy, i.e., to do slicing, traversal
etc. Basically, this representation would do the bookkeeping. This class
could also contain the information in the parsed header (cell, spacegroup,
etc.).

A second class (let's say a Connectivity class) could contain a simple graph
of atoms with all the bonds between those atoms (so including inter and
intra residue bonds). This would be convenient for people who want to use
the python package to prepare the input for some kind of refinement or
visualization program. Maybe this representation could also do nearest
neighbor lookup, angle calculations, rotations & translations etc. Note that
this thus implies figuring out which atoms are bonded to which atoms, which
is not specified in the PDB file itself. Implementing the previous approach
structure is trivial, while the implementation of this last approach is much
harder.

Of course, in many cases you would like to take a look at what is in the pdb
file, e.g., you could want to examine all disulphide bridges. In that case,
you would want to work with a number of class instances (let's say from the
Polymer class) that represent various structural entities (polypeptides,
disulphide bridges, alpha-helices etc.). For this, you need the connectivity
information of course. This representation would be the structural
interpretation of the PDB file.

It is clear that more than one approach should be possible for a structure
object. One way e.g. to combine the requirements is to attach the Polymer
objects as observers to the Structure objects. In this approach, you could
extract e.g. all Polymer objects from a Chain object. Each of these Polymer
objects would contain at least one residue from that chain. In this way, you
could e.g. ask questions like "give me all disulphide bridges that involve
chain A in model 1", which combines the bookkeeping with the structural
interpretation demands. The structure class would also produce the raw
connectivity information in a convenient data structure on demand. I think
this can all be done quite efficiently using Numpy, kjbuckets etc.

I'm still working on a Structure object, which is mainly reworking older
code that I'm not happy with. I have a lot of code lying around that I would
like to put in a shape for general use. I hope to make a Structure class
that does the bookkeeping available next week.

Friendly regards,

---
Thomas Hamelryck      Vrije Universiteit Brussel (VUB)
Intitute for Molecular Biology            ULTR Department
Paardenstraat 65    1640 Sint-Gensius-Rode, Belgium
                 http://ultr.vub.ac.be/~thomas