[Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used?
Martin Mokrejs
mmokrejs at fold.natur.cuni.cz
Tue Sep 25 01:28:36 UTC 2012
Hi Wibowo,
will you also add the cElementTree calls to NCBIXML (replacing SAX parser)?
I would have to lookup how the record attributes changed(=renamed)
from those specific for blast to those generalized and used(=promoted)
by SearchIO. Do you have a list of sed regexps? ;-)
Or shall I just replace NCBIXML.parse() with SearchIO.parse()? ;-)
Both would be certainly helpful at 3 a.m. :(( I am leaving the profiler
running overnight and will at least be able to lookup where the bottleneck
in the current NCBIXML is. The rest ... next time. ;-)
I see, hsp.sbjct_start is renamed to hsp.hit_start ... How about exposing
*additionally* the data through "old" names? So that "SearchIO" would expose
both hsp.hit_start and also hsp.sbjct_start ... and maybe even more if blat
or hmmer parsers expose so far other attributes as well. I know it is ugly but
makes the transition smoother. ;-) Those are just references. ;-) I would
like to do something like:
try:
# latest and greatest biopython version installed
from Bio import SearchIO
except ImportError:
# some old installation
from Bio import SeqIO
but have the rest of my code unchanged. Umm, I use NCBIXML.parse() so I won't need
even the above. You just change it in your git branch and I won't have to touch my
code. That's fair, isn't it? ;-)
Did you profile biopython or SearchIO yourself?
Best,
Martin
Wibowo Arindrarto wrote:
> Hi Martin,
>
> There is actually already a faster BLAST XML parser written using
> cElementTree in Biopython :) (although it's yet to be included in the
> main branch). It's part of Biopython's SearchIO module that I recently
> wrote (the name SearchIO might change in the future). And indeed, my
> early benchmarks has shown that it does perform faster.
>
> This branch is available here:
> https://github.com/bow/biopython/tree/searchio. I've also written a
> draft tutorial on how to use it here:
> http://bow.web.id/biopython/Tutorial.html#htoc96.
>
> However, as it's not yet in the current branch, you need to do a
> little bit of command line work to set it up:
>
> 1. Set up a new virtualenv environment (so that it doesn't clash with
> your other Biopython installation) and activate it.
> 2. Clone the repository: `git clone
> https://github.com/bow/biopython.git`, checkout the 'searchio' branch
> 3. Run `python setup.py develop`. This will keep the
> installation in-sync with any future `git pull` you might perform on
> the branch.
>
> Hope this helps :),
> Bow
>
>
> On Thu, Sep 13, 2012 at 5:20 PM, Martin Mokrejs
> <mmokrejs at fold.natur.cuni.cz> wrote:
>> Hi,
>> I am using "blastall -p blastn ... -m 7" to yield about 100GB large XML files
>> which are then parsed by
>>
>> from Bio.Blast import NCBIXML
>> _blastn_fileh = open(blast_out_xml_filename)
>> _blastn_iterator = NCBIXML.parse(_blastn_fileh)
>> _record = _blastn_iterator.next() # fetch the very first BLAST result from generator
>>
>> In my case the blastn searches seem to take longer than takes the XML parsing. :(
>> I do not have timing numbers here but wonder why is cElementTree used only in Uniprot
>> biopython modules and not in SeqIO. What XML parsing library is my biopython-1.59 using?
>> Isn't there any argument when setup.py is called to discern between elementtree, cElementTree
>> which I think use expat ...? I am writing this a bit from top of my head hoping Peter ;-)
>> or somebody else will know right away where to look for a performance bottleneck
>> and where to change code to use cElementTree which always seemed the fastest to me.
>> Thank you for some initial advice.
>> Martin
>> P.S.: And yes, I would love to parse blastn plaintext output or some other more compact one,
>> the XML is really an overkill.
>> _______________________________________________
>> Biopython mailing list - Biopython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
>
>
More information about the Biopython
mailing list