[Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used?

Martin Mokrejs mmokrejs at fold.natur.cuni.cz
Tue Sep 25 11:15:19 UTC 2012


Peter Cock wrote:
> On Tue, Sep 25, 2012 at 2:28 AM, Martin Mokrejs
> <mmokrejs at fold.natur.cuni.cz> wrote:
>> Hi Wibowo,
>>   will you also add the cElementTree calls to NCBIXML (replacing SAX parser)?
>> I would have to lookup how the record attributes changed(=renamed)
>> from those specific for blast to those generalized and used(=promoted)
>> by SearchIO. Do you have a list of sed regexps? ;-)
>>
>>   Or shall I just replace NCBIXML.parse() with SearchIO.parse()? ;-)
>> Both would be certainly helpful at 3 a.m. :(( I am leaving the profiler
>> running overnight and will at least be able to lookup where the bottleneck
>> in the current NCBIXML is. The rest ... next time. ;-)
> 
> We did discuss updating the internals of the old NCBIXML parser
> to use ElementTree / cElementTree, but currently the plan is to
> simply deprecate the old parser, so this seems a wasted effort.

Then it means for me that some parts of my code will exist twice. As you said below
the structuring of object in searchio vs. NCBIXML is different so I will really
need two routines. :( One for newer installation and one for (most) older biopython
versions.

I would really suggest to spend some effort on optimizing the coding style
of the old parser. The gain might be quite substantial and easy to gain for you
and at no cost for end users.

> 
>>   I see, hsp.sbjct_start is renamed to hsp.hit_start ... How about exposing
>> *additionally* the data through "old" names? So that "SearchIO" would expose
>> both hsp.hit_start and also hsp.sbjct_start ... and maybe even more if blat
>> or hmmer parsers expose so far other attributes as well. I know it is ugly but
>> makes the transition smoother. ;-) Those are just references. ;-) I would
>> like to do something like:
>>
>> try:
>>     # latest and greatest biopython version installed
>>     from Bio import SearchIO
>> except ImportError:
>>     # some old installation
>>     from Bio import SeqIO
>>
>>
>> but have the rest of my code unchanged. Umm, I use NCBIXML.parse()
>> so I won't need even the above. You just change it in your git branch
>> and I won't have to touch my code. That's fair, isn't it? ;-)
> 
> The plan is to reward people for updating their code by giving them
> faster BLAST XML parsing (and an easy way to try out other input
> file formats in future).

That will take a long while for people to switch over. Fix all HOWTOs
and other docs all over the websites in the world ... That's a long shot.
I would really try to provide a mapping interface so that people can just
do the above try/except trick during module import.

> 
> Note that Bio.SearchIO is the working name and current namespace
> used on the branch, but is unlikely to be the final name.

That's no problem for me.

> 
> And I'm not keen on adding backwards compatible aliases for the old
> BLAST parser names - even if they did come with deprecation warnings.
> In fact I suspect even that wouldn't give you the drop in replacement
> you are hoping for, the object heirachy has changed too.

I understand you reasoning but maintaining two copies of functionally
same code is boring for users as well. ;-) I can adjust for that myself,
sure.

> 
> However, if there are some specific cases where you think the old
> name is still sensible given the broader scope of the new parser
> covering many other formats as well as BLAST, then some minor
> renames seems more reasonable.
> 
>>   Did you profile biopython or SearchIO yourself?
>> Best,
>> Martin
> 
> Bow did some profiling of the old NCBIXML parser against his
> SearchIO work.
> 
> Peter
> 
> 



More information about the Biopython mailing list