[Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used?

Tue Sep 25 12:26:48 UTC 2012

On Tue, Sep 25, 2012 at 12:26 PM, Martin Mokrejs
<mmokrejs at fold.natur.cuni.cz> wrote:
> Peter Cock wrote:
>
>> Currently the HSP object in SearchIO uses hit_start,
>> hit_end, query_start and query_end - but also note
>> that we're using Python counting.
>
> Ah, thanks for the reminder. Yes, this is exactly why I wasn't very happy to re-implement
> my code right now to use searchio but forgot to say that. I already did fix all the
> off-by-one tweaks in my code to use somewhere the zero-based counting and somewhere to
> rather use 1-based (where human is reading the output text files/tables). And these are
> scattered through the program (I think) and this will be probably the major stopper for me.
> ;) Things might break for me all over the places.

Sadly whenever we are dealing with position input/output there
will be off by one adjustments required. I think it is wise to use
just one standard internally to a tool, and for Python that means
zero based counting.

> I am not saying this is good idea but really, providing cElementTree calls from within
> NCBIXML would be more appealing to me (instead of current python-based expat parser
> calls).

OK - so there is at least one person making heaving use of the
NCBIXML so we shouldn't rush to deprecate it after merging
SearchIO, and there *is* some benefit from making it faster
(but with the same API).

In principle NCBIXML would be rewritten to use cElementTree
/ElementTree and preserve the API - if you or anyone else want
to do that (and the unit tests still pass), then I'm happy to review
such changes. Likewise for less dramatic optimisations.

Regards,

Peter