[Biopython] Bio.PDB local MMCIF files

João Rodrigues anaryin at gmail.com
Wed Feb 19 14:39:10 UTC 2014


Hello,

The implementation I was referring to by the EBI people is
here<http://www.ebi.ac.uk/~glen/PDBeCIF/>.
I tested it during a workshop and it is very fast and robust (they use it,
that should be enough reason) so maybe we could benefit a lot from either
its incorporation or adaptation?

As for what I suggested. Since my GSOC period, already 4 years ago.., I
noticed that the PDB module is a bit messy in terms of organization. The
module itself if named after the databank, which can be confused with the
format name, the mmcif parser is defined inside in a subfolder and there
are application wrappers there too (DSSP, NACCESS). Besides this issue,
which is not an issue at all and just my own pet peeve, there is a lot that
the entire module could gain from a thorough revision. I've been using it
very often and some normal manipulations of structures are not
straightforward to carry out (calculating a center of mass for example,
removing double occupancies) due to the parser being slow and quite memory
hungry. In fact, trying to run the parser on a very large collection of
structures often results in a random crash due to memory issues.

I've been toying with a lot of changes, performance improvements, etc, but
I'm not satisfied at all with them.. somethings that i've been trying is to
have the structure coordinates defined as a full numpy array instead of N
arrays per structure (one per atom) or the usage of __slots__ to mitigate
memory usage (managed to get it down 33% this way). This would also go in
line with a suggestion from Eric a long time ago to make a Bio.Struct
module which would be the perfect "playground" to implement and test these
changes. Other developments that I think are worth looking into are for
example making a nice library to link a parsed structure to the PDB
database and fetch information on it using the REST services they provide.

I'd like to hear your opinion (as in, everybody, developers and users) on
this and if it makes sense to indeed give a bit of TLC to the Bio.PDB
module. Also, on what changes you think should be carried out to improve
the module, like which features are missing, which applications are worth
wrapping.

Just to kick off some discussion. Maybe a new thread should be opened for
this later on.

Cheers,

João




More information about the Biopython mailing list