[Biopython-dev] Future of Bio.PDB

Wed Feb 19 16:35:56 UTC 2014

I frequently make use of Bio.PDB, and agree wholeheartedly that certain
aspects of it are very dated, or haphazardly organized.

The module as a whole would benefit greatly from some extra attention. I'm
happy to lend a hand in whatever revamp takes place.

David Cain
+1 (339) 222 4452

On Wed, Feb 19, 2014 at 11:22 AM, Eric Talevich <eric.talevich at gmail.com>wrote:

> On Wed, Feb 19, 2014 at 6:54 AM, João Rodrigues <anaryin at gmail.com> wrote:
>
> > From another thread:
> >
> > As for what I suggested. Since my GSOC period, already 4 years ago.., I
> > > noticed that the PDB module is a bit messy in terms of organization.
> The
> > > module itself if named after the databank, which can be confused with
> the
> > > format name, the mmcif parser is defined inside in a subfolder and
> there
> > > are application wrappers there too (DSSP, NACCESS). Besides this issue,
> > > which is not an issue at all and just my own pet peeve, there is a lot
> > that
> > > the entire module could gain from a thorough revision. I've been using
> it
> > > very often and some normal manipulations of structures are not
> > > straightforward to carry out (calculating a center of mass for example,
> > > removing double occupancies) due to the parser being slow and quite
> > memory
> > > hungry. In fact, trying to run the parser on a very large collection of
> > > structures often results in a random crash due to memory issues.
> > > I've been toying with a lot of changes, performance improvements, etc,
> > but
> > > I'm not satisfied at all with them.. somethings that i've been trying
> is
> > to
> > > have the structure coordinates defined as a full numpy array instead
> of N
> > > arrays per structure (one per atom) or the usage of __slots__ to
> mitigate
> > > memory usage (managed to get it down 33% this way). This would also go
> in
> > > line with a suggestion from Eric a long time ago to make a Bio.Struct
> > > module which would be the perfect "playground" to implement and test
> > these
> > > changes. Other developments that I think are worth looking into are for
> > > example making a nice library to link a parsed structure to the PDB
> > > database and fetch information on it using the REST services they
> > provide.
> > > I'd like to hear your opinion (as in, everybody, developers and users)
> on
> > > this and if it makes sense to indeed give a bit of TLC to the Bio.PDB
> > > module. Also, on what changes you think should be carried out to
> improve
> > > the module, like which features are missing, which applications are
> worth
> > > wrapping.
> > > Just to kick off some discussion. Maybe a new thread should be opened
> for
> > > this later on.
> > > Cheers,
> > > João
> >
> >
> > As for the name of the module, yes, Bio.Struct is just the "legacy" name
> I
> > remember.. Bio.structure would probably be better and more clear.
> >
>
> The p3d folks once offered to incorporate their work into Biopython:
> http://www.biomedcentral.com/1471-2105/10/258
>
> We had concerns about having p3d and Bio.PDB coexisting within Biopython,
> but if someone wanted to emulate the Bio.PDB API on top of p3d, or
> otherwise slip p3d's secret sauce into the Bio.PDB internals, that would do
> the trick. (I have not thought about the details of how this would work at
> all.) I think it should also be possible to replace p3d's custom query
> language with the sort of tricks Bio.Phylo, pandas and SqlAlchemy do with
> keyword arguments and generators to get the same results with Python
> syntax.
>
> Alternatively, there is the option of sticking with the Bio.PDB namespace
> and adding only "read", "write" and "convert" functions to
> Bio/PDB/__init__.py to make the basic usage of the module more similar to
> the other Biopython sub-packages. The Model class could store one or
> several NumPy arrays that cover all atom coordinates, and the Chain,
> Residue, Atom and Interface classes would probably just store references to
> that array, e.g. a shorter 1D array of integer row indexes.
>
> Would either of these internal changes make it easier to apply the GSoC
> work that's been done on Bio.PDB?
>
> -Eric
>
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>