[Biopython-dev] Benchmarking PDBParser

Fri May 13 06:35:27 UTC 2011

Assigning the element on demand would be too slow, specially when working
with modelling structures or other element-less 'formats'. Id replace your
option B for a function to assign elements that could be called once, at
will, from any entity subclass.

On the other hand, optimizing the process probably will help but not by much
i would say. Does anyone have ideas on this? Maybe a dictionary with all
possible options of atom fullnames?

A third issue here is also the overhead that parsing the header brings. It
completely kills performance.. There is a flag in the parser called
get_header that is useless at the moment. A first step would be to make
usable. At least we would have an option to skip the slow part. Perhaps then
it would be nice to look at parse_pdb_header and see if we can optimize it.
Im curious to see the performance of my branch there because i added more
parsing options there too.

Cheers,

João
No dia 13 de Mai de 2011 04:27, "Eric Talevich" <eric.talevich at gmail.com>
escreveu:
> On Thu, May 12, 2011 at 9:59 AM, João Rodrigues <anaryin at gmail.com> wrote:
>
>> First results: http://www.biopython.org/wiki/PDBParser
>>
>> Comments?
>>
>
> Cool. So the atom_element additions did slow the parser down noticeably.
The
> warnings may have caused some tiny slowdown, presumably when handling PDB
> files with inconsistencies, but I personally am not concerned about that.
>
> I think atom element assignment could be sped up in either of two ways:
> (a) Try to optimize Atom._assign_element for speed, somehow
> (b) Store only the atom field as a string during parsing. Change
> Atom.element and Atom.mass to be properties that parse the atom field to
> determine the element type on demand (i.e. self._get_element checks if
> self._element exists yet; if not, parse the string and set self._element;
> self._get_mass is basically identical to _assign_atom_mass).
>
> The lazy loading approach (b) would be faster if you're not using the
> element/mass values at all, but probably a little slower if you need those
> values from every atom in a structure.
>
> -E