[Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used?

Peter Cock p.j.a.cock at googlemail.com
Tue Sep 25 08:26:48 EDT 2012


On Tue, Sep 25, 2012 at 12:26 PM, Martin Mokrejs
<mmokrejs at fold.natur.cuni.cz> wrote:
> Peter Cock wrote:
>
>> Currently the HSP object in SearchIO uses hit_start,
>> hit_end, query_start and query_end - but also note
>> that we're using Python counting.
>
> Ah, thanks for the reminder. Yes, this is exactly why I wasn't very happy to re-implement
> my code right now to use searchio but forgot to say that. I already did fix all the
> off-by-one tweaks in my code to use somewhere the zero-based counting and somewhere to
> rather use 1-based (where human is reading the output text files/tables). And these are
> scattered through the program (I think) and this will be probably the major stopper for me.
> ;) Things might break for me all over the places.

Sadly whenever we are dealing with position input/output there
will be off by one adjustments required. I think it is wise to use
just one standard internally to a tool, and for Python that means
zero based counting.

> I am not saying this is good idea but really, providing cElementTree calls from within
> NCBIXML would be more appealing to me (instead of current python-based expat parser
> calls).

OK - so there is at least one person making heaving use of the
NCBIXML so we shouldn't rush to deprecate it after merging
SearchIO, and there *is* some benefit from making it faster
(but with the same API).

In principle NCBIXML would be rewritten to use cElementTree
/ElementTree and preserve the API - if you or anyone else want
to do that (and the unit tests still pass), then I'm happy to review
such changes. Likewise for less dramatic optimisations.

Regards,

Peter


More information about the Biopython mailing list