[Biopython-dev] [GSOC] Report - Week 1
Thomas Hamelryck
thomas.hamelryck at gmail.com
Tue Jun 8 06:39:53 UTC 2010
Hi all,
I think it's great that Bio.PDB is being updated.
Here are some remarks:
I haven't seen much discussion about the one key feature of Bio.PDB
that definitely needs to be improved: its speed. With the enormous
increase of the number of structures, extracting data using Bio.PDB is
too slow. Would be good to move some parts to C.
A second issues is nicely illustrated by the following code snippet:
> s = p.get_structure('a', '2KSX.pdb')
> [i for i in s.search_ss_bonds()]
I think this is NOT the way to do it. PDB files can contain anything
RNA, DNA, sugars, small molecules... It is thus not a good idea to
directly associate protein-specific methods to the structure class; it
will lead to a bloated Structure class and a lot of irrelevant methods
(ie. search_ss_bonds is meaningless for a PDB file that contains RNA).
Currently, one creates Polypeptide objects from a Structure object
using a factory design pattern (via PPBuilder); the Polypeptide class
implements some protein specific methods. I believe that is a much
cleaner way to do it (though we need a Protein class that represents
collections of connected polypeptides). One can also make sure that
all such derived objects (Protein, NA, DNA,...) adhere to the same
interface by providing a suitable base class with shared functionality
- in that way, the whole thing is also extendible.
Something like:
s = p.get_structure('a', '2KSX.pdb')
pb = ProteinBuilder()
proteins = pb.build(structure)
ssbridges = proteins.get_ss_bonds()
Here, "proteins" would represent a collection of polypeptide chains.
Cheers,
-Thomas
--
Thomas Hamelryck, Assoc. Prof.
Group leader Structural Bioinformatics
Bioinformatics center
Department of Biology
University of Copenhagen
Ole Maaloes Vej 5
DK-2200 Copenhagen N
Denmark
http://www.binf.ku.dk/research/structural_bioinformatics/
More information about the Biopython-dev
mailing list