[Biopython-dev] [GSOC] Report - Week 1

Tue Jun 8 06:39:53 UTC 2010

Hi all,

I think it's great that Bio.PDB is being updated.

Here are some remarks:

I haven't seen much discussion about the one key feature of Bio.PDB
that definitely needs to be improved: its speed. With the enormous
increase of the number of structures, extracting data using Bio.PDB is
too slow. Would be good to move some parts to C.

A second issues is nicely illustrated by the following code snippet:

> s = p.get_structure('a', '2KSX.pdb')
> [i for i in s.search_ss_bonds()]

I think this is NOT the way to do it. PDB files can contain anything
RNA, DNA, sugars, small molecules... It is thus not a good idea to
directly associate protein-specific methods to the structure class; it
will lead to a bloated Structure class and a lot of irrelevant methods
(ie. search_ss_bonds is meaningless for a PDB file that contains RNA).

Currently, one creates Polypeptide objects from a Structure object
using a factory design pattern (via PPBuilder); the Polypeptide class
implements some protein specific methods. I believe that is a much
cleaner way to do it (though we need a Protein class that represents
collections of connected polypeptides). One can also make sure that
all such derived objects (Protein, NA, DNA,...) adhere to the same
interface by providing a suitable base class with shared functionality
- in that way, the whole thing is also extendible.

Something like:

s = p.get_structure('a', '2KSX.pdb')
pb = ProteinBuilder()
proteins = pb.build(structure)
ssbridges = proteins.get_ss_bonds()

Here, "proteins" would represent a collection of polypeptide chains.

Cheers,

-Thomas

-- 
Thomas Hamelryck, Assoc. Prof.
Group leader Structural Bioinformatics
Bioinformatics center
Department of Biology
University of Copenhagen
Ole Maaloes Vej 5
DK-2200 Copenhagen N
Denmark
http://www.binf.ku.dk/research/structural_bioinformatics/