[Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used?
Peter Cock
p.j.a.cock at googlemail.com
Tue Sep 25 08:09:26 UTC 2012
On Tue, Sep 25, 2012 at 2:28 AM, Martin Mokrejs
<mmokrejs at fold.natur.cuni.cz> wrote:
> Hi Wibowo,
> will you also add the cElementTree calls to NCBIXML (replacing SAX parser)?
> I would have to lookup how the record attributes changed(=renamed)
> from those specific for blast to those generalized and used(=promoted)
> by SearchIO. Do you have a list of sed regexps? ;-)
>
> Or shall I just replace NCBIXML.parse() with SearchIO.parse()? ;-)
> Both would be certainly helpful at 3 a.m. :(( I am leaving the profiler
> running overnight and will at least be able to lookup where the bottleneck
> in the current NCBIXML is. The rest ... next time. ;-)
We did discuss updating the internals of the old NCBIXML parser
to use ElementTree / cElementTree, but currently the plan is to
simply deprecate the old parser, so this seems a wasted effort.
> I see, hsp.sbjct_start is renamed to hsp.hit_start ... How about exposing
> *additionally* the data through "old" names? So that "SearchIO" would expose
> both hsp.hit_start and also hsp.sbjct_start ... and maybe even more if blat
> or hmmer parsers expose so far other attributes as well. I know it is ugly but
> makes the transition smoother. ;-) Those are just references. ;-) I would
> like to do something like:
>
> try:
> # latest and greatest biopython version installed
> from Bio import SearchIO
> except ImportError:
> # some old installation
> from Bio import SeqIO
>
>
> but have the rest of my code unchanged. Umm, I use NCBIXML.parse()
> so I won't need even the above. You just change it in your git branch
> and I won't have to touch my code. That's fair, isn't it? ;-)
The plan is to reward people for updating their code by giving them
faster BLAST XML parsing (and an easy way to try out other input
file formats in future).
Note that Bio.SearchIO is the working name and current namespace
used on the branch, but is unlikely to be the final name.
And I'm not keen on adding backwards compatible aliases for the old
BLAST parser names - even if they did come with deprecation warnings.
In fact I suspect even that wouldn't give you the drop in replacement
you are hoping for, the object heirachy has changed too.
However, if there are some specific cases where you think the old
name is still sensible given the broader scope of the new parser
covering many other formats as well as BLAST, then some minor
renames seems more reasonable.
> Did you profile biopython or SearchIO yourself?
> Best,
> Martin
Bow did some profiling of the old NCBIXML parser against his
SearchIO work.
Peter
More information about the Biopython
mailing list