[Biopython-dev] Module reorganization for upcoming Bio.PDB enhancements

Mon May 31 15:53:31 UTC 2010

On Mon, May 31, 2010 at 4:38 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
> Hi all,
>
> This summer our GSoC student João Rodrigues will be implementing a number of
> enhancements to Biopython's structural biology modules. Since Bio.PDB is one
> of the most widely used parts of Biopython, I'd like to find a way to
> let João add major new features without breaking existing code and
> documentation.
>
> There are a few issues I'd like to address:
>
> 1. The I/O conventions of parse/read/write/convert seem to work very well in
> SeqIO, AlignIO, Phylo, and other Biopython sub-packages. Bio.PDB supports
> I/O in several formats, but the API is lower-level and isn't unified in the
> same way (yet).

Currently Bio.PDB supports the plain text PDB format, and has partial
support for mmCIF. It lacks support for the XML PDB format, PDBML -
Protein Data Bank Markup Language.

Under this proposed scheme, what would you see as the basic record type
(analogous to a SeqRecord, alignment or tree in Bio.SeqIO, Bio.AlignIO and
Bio.Phylo)? It would be nice to say a protein chain, but there is the issue of
multiple models (e.g. from NMR). I presume you'd go with the model as the
basic unit (where each model may contain multiple chains).

> 2. PDB headers seem to have become better structured in recent years, in
> both the wwPDB spec and submitted files. But header info isn't well
> integrated with PDB Structure object, and parse_pdb_header needs some
> attention as well.

Agreed.

> 3. Kristian asked on this list awhile ago about the proper location for his
> new code that works with RNA structures. While RCSB's PDB contains some RNA
> structures, the RNA world doesn't revolve around it. Similarly, João needs a
> place to put code for structure prediction/validation servers, command-line
> wrappers, secondary structures, etc.
>
>
> I propose a new sub-package called Bio.Struct for these enhancements:
>
> from Bio import Struct
> mystruct = Struct.read("1MOT.pdb", "pdb")
> # Or, letting the format argument default to "pdb":
> mystruct = Struct.read("1MOT.pdb")
> # Eventually this will work too:
> Struct.convert("1MOT.pdb", "pdb", "1MOT.xml", "pdbxml")

I'd probably go with "pdbml" rather than "pdbxml" since that seems to be
what the PDB themselves call it:
http://www.pdb.org/pdb/static.do?p=file_formats/index.jsp

> from Bio.Struct.Applications import DSSP
> # Like the other command-line wrappers
> # (I'm curious about Peter's cunning new scheme...)

See:
http://lists.open-bio.org/pipermail/biopython-dev/2010-May/007773.html

> from Bio.Struct import WHATIF, Jpred
> # Servers each get their own module

Hmm - perhaps we may need have another level here, Bio.Struct.Servers
or Bio.Struct.WWW or something. How many of these do you expect?

> from Bio.Struct import RNA
> # Would this work for you, Kristian?
>
>
> Alternatively, we could do all of this within the PDB module -- so picture
> the above examples with "PDB" in place of "Struct". This raises the chance
> of naming collisions, though, and doesn't solve issue #3 above.
>
>
> We'll leave the existing PDB module layout alone, in general. I think it
> will be necessary to add a few more attributes to the
> Bio.PDB.Structure.Structure class, but we can do this without breaking
> compatibility. Since fewer people depend on the exact formatting of the
> Structure.header data (we believe), it's safer to change this dictionary,
> moving the more essential entries to a separate attribute, or whatever seems
> reasonable when we dig into it.
>
> Comments?

I don't want us to break backwards compatibility in Bio.PDB (given how
widely used it seems to be based on citations at least), but would like
us to continue making small fixes or enhancements to it. Therefore a
new Bio.Struct module may be the safer option.

Peter