[Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used?

Tue Sep 25 09:53:08 UTC 2012

Hi Martin, Peter,

I agree with Peter. The new object model is a bit different from the
old one in Bio.Blast, so a simple search & replace might not do the
trick. The same goes with the attribute names. I suppose I could add
one table in the draft tutorial to list the new attribute names, but I
prefer not to have any Bio.Blast-compatible names in the code.

As for the profiling, I did some quick benchmarks but it wasn't really
thorough. I only compared the parsing times of Bio.Blast.NCBIXML and
the new BLAST XML parser in SearchIO. Using a test file containing
1000 BLAST queries (286 Mb total), the results were as follows:

on SearchIO:
97.11
93.66
94.13
91.35
90.90
Total time : 467.15
Average    : 93.43

on Bio.Blast:
441.45
412.57
471.31
434.22
429.35
Total time : 2188.90
Average    : 437.78

The speed-up was almost 5x. I didn't check for any optimizable
bottlenecks, though.

Hope that helps,
Bow

On Tue, Sep 25, 2012 at 10:09 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Tue, Sep 25, 2012 at 2:28 AM, Martin Mokrejs
> <mmokrejs at fold.natur.cuni.cz> wrote:
>> Hi Wibowo,
>>   will you also add the cElementTree calls to NCBIXML (replacing SAX parser)?
>> I would have to lookup how the record attributes changed(=renamed)
>> from those specific for blast to those generalized and used(=promoted)
>> by SearchIO. Do you have a list of sed regexps? ;-)
>>
>>   Or shall I just replace NCBIXML.parse() with SearchIO.parse()? ;-)
>> Both would be certainly helpful at 3 a.m. :(( I am leaving the profiler
>> running overnight and will at least be able to lookup where the bottleneck
>> in the current NCBIXML is. The rest ... next time. ;-)
>
> We did discuss updating the internals of the old NCBIXML parser
> to use ElementTree / cElementTree, but currently the plan is to
> simply deprecate the old parser, so this seems a wasted effort.
>
>>   I see, hsp.sbjct_start is renamed to hsp.hit_start ... How about exposing
>> *additionally* the data through "old" names? So that "SearchIO" would expose
>> both hsp.hit_start and also hsp.sbjct_start ... and maybe even more if blat
>> or hmmer parsers expose so far other attributes as well. I know it is ugly but
>> makes the transition smoother. ;-) Those are just references. ;-) I would
>> like to do something like:
>>
>> try:
>>     # latest and greatest biopython version installed
>>     from Bio import SearchIO
>> except ImportError:
>>     # some old installation
>>     from Bio import SeqIO
>>
>>
>> but have the rest of my code unchanged. Umm, I use NCBIXML.parse()
>> so I won't need even the above. You just change it in your git branch
>> and I won't have to touch my code. That's fair, isn't it? ;-)
>
> The plan is to reward people for updating their code by giving them
> faster BLAST XML parsing (and an easy way to try out other input
> file formats in future).
>
> Note that Bio.SearchIO is the working name and current namespace
> used on the branch, but is unlikely to be the final name.
>
> And I'm not keen on adding backwards compatible aliases for the old
> BLAST parser names - even if they did come with deprecation warnings.
> In fact I suspect even that wouldn't give you the drop in replacement
> you are hoping for, the object heirachy has changed too.
>
> However, if there are some specific cases where you think the old
> name is still sensible given the broader scope of the new parser
> covering many other formats as well as BLAST, then some minor
> renames seems more reasonable.
>
>>   Did you profile biopython or SearchIO yourself?
>> Best,
>> Martin
>
> Bow did some profiling of the old NCBIXML parser against his
> SearchIO work.
>
> Peter