[Biopython-dev] Optimization of PDBParser and friends

Wed Sep 5 23:31:42 UTC 2012

On Wed, Sep 5, 2012 at 9:24 PM, João Rodrigues <anaryin at gmail.com> wrote:
> Hello all,
>
> Some news.
>
> A. The OrderedDict implementation is quite slow. It essentially slows down
> the parser by 30%, rendering all the improvements I had done moot.
> Therefore, although it's a great idea, a major reason for these updates is
> speed so I think it might not be worth it.

Which Python was that? i.e. The OrderedDict from the standard lib
(which I hope is optimised), or the back port (which might be slower).

> B. As an alternative to this, I implemented the following. Entity has now
> only child_dict, and is a general dictionary. However, each Object (Model,
> Chain, Residue, Atom) gets their own __cmp__ method overloaded with the
> information in the "_sort" methods that already existed. In this way, a
> simple sorting of the values of the dictionary returns an ordered list. I
> tweaked the Atom.__cmp__ to first sort N CA C O atoms and then
> alphabetically. I also added that inorganic atoms such as Calcium come at
> the end. This will make things a bit nicer when Calcium is involved for
> example. Finally, the only downside to this seems to be that we lose the
> order in which residues are inserted. Ie. if residue 151 is the first of the
> PDB file and all others range from 1-150, then this first 151 is going to be
> placed at the end when you iterate. However, from my experience and in my
> opinion, not only this is logical, but it also rarely happens in real PDB
> files.

That seems risky - but see if you can sort out what is happening
with the unit tests (below).

I'm not sure about your atomic sorting... it seems a bit magic. Would
sorting on atomic number be nicer (and simple)?

> C. I am strongly in favour of removing most (if not all) set/get methods and
> replace them by direct attribute access. For instance, "atom.get_parent()
> --> atom.parent". Saves some space in the code and makes things more
> transparent.

It would also look less like Java code ;)

I like this plan - but initially define and document the new properties,
and deprecate the old get/set properties. Without that you'll break
almost every PDB using script out there.

> D. I edited the PDBParser to tweaks a few things, nothing major. The file
> handle is now treated as an iterator throughout the parsing and it should be
> more memory-friendly. The line counter is still preserved. I also added a
> test to make the get_header argument actually work.
>
> E. General things here and there that I can't just remember..
>
> F. Unittests are breaking everywhere. Checking why, but it all seems related
> to this sorting issue.
>
> Cheers,
>
> João

Regards,

Peter