[Biopython-dev] Module reorganization for upcoming Bio.PDB enhancements

Wed Jun 9 09:31:18 EDT 2010

On Tue, Jun 8, 2010 at 1:10 PM, João Rodrigues <anaryin at gmail.com> wrote:

> Hello all,
>
> I'm replying here to what Thomas wrote on the GSOC Report thread because it
> seems a better place.
>
> PDB files can contain anything RNA, DNA, sugars, small molecules... It is
>> thus not a good idea to
>> directly associate protein-specific methods to the structure class; it
>> will lead to a bloated Structure class and a lot of irrelevant methods (ie.
>> search_ss_bonds is meaningless for a PDB file that contains RNA).
>
>
> Agree.
>
> Currently, one creates Polypeptide objects from a Structure object using a
>> factory design pattern (via PPBuilder); the Polypeptide class implements
>> some protein specific methods. I believe that is a much cleaner way to do it
>> (though we need a Protein class that represents collections of connected
>> polypeptides). One can also make sure that all such derived objects
>> (Protein, NA, DNA,...) adhere to the same interface by providing a suitable
>> base class with shared functionality - in that way, the whole thing is also
>> extendible.
>>
>
> I think there has been already some discussion about this. My personal
> opinion/suggestion is having a structure like:
>
> Bio.PDB/
> _______/Protein.py
> _______/DNA.py
> _______/RNA.py
>
> that would translate to an usage of something like:
>
> from Bio.PDB import Protein
> structure = Protein('1ABC.pdb')
> structure.search_ss_bonds()
>
> but not
>
> structure.calc_melting_temperature() (just an example)
>

How about:

from Bio import Struct

# extract the protein from a bound TF structure
complex = Struct.read("3IKT.pdb")
prot = complex.as_protein()

# which is a wrapper for:
from Bio.Struct.Protein import Protein
# if Protein contains a Structure instance:
prot = Protein(complex)
# or, if Protein inherits from Structure:
prot = Protein.from_structure(complex)

The Bio.Struct.Protein module would mostly wrap Bio.PDB's protein-specific
functionality, and contain a class called Protein which you construct using
a Bio.PDB.Structure.Structure instance, in some way.

I think the convenience methods as_protein, as_dna and as_rna are acceptable
additions to the Structure class if that saves us from (a) polluting
Structure with protein- and RNA-specific methods, or (b) requiring a slew of
imports to reach any new functionality. You can add as_protein yourself and
leave the other methods for other brave souls to implement. (Bio.Struct.RNA
deserves its own directory, and I don't know of anyone working on a
structural DNA branch.)

Protein() would call PDBParser(). It could also include, to a certain
> extent, an Alphabet-like feature to assure residue names are OK (this goes a
> bit with this proposal<http://www.biopython.org/wiki/GSOC2010_Joao#Residue_name_normalisation>).
> I believe this goes a bit into what you said. Having a class that basically
> abstracts what we do now (Bio.PDB.PDBParser) and allows for
> molecule-specific methods. However, it also leads to some problems:
> Protein/DNA complexes come to mind.
>
> How does this sound? I think it goes with what Eric said in the first post
> of this thread and what Thomas replied in the GSOC thread. We should also
> change the PDB name to Struct to better reflect the purpose of the module.
> All of the other additions like Bio.Struct.WWW would still apply. And I
> don't see a major problem in breaking the existing code by adding this.
>

To be clear, we don't need to rename anything -- Bio.Struct and Bio.PDB can
live in harmony for the foreseeable future.

Best,
Eric