[Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used?

Tue Sep 25 01:28:36 UTC 2012

Hi Wibowo,
  will you also add the cElementTree calls to NCBIXML (replacing SAX parser)?
I would have to lookup how the record attributes changed(=renamed)
from those specific for blast to those generalized and used(=promoted)
by SearchIO. Do you have a list of sed regexps? ;-)

  Or shall I just replace NCBIXML.parse() with SearchIO.parse()? ;-)
Both would be certainly helpful at 3 a.m. :(( I am leaving the profiler
running overnight and will at least be able to lookup where the bottleneck
in the current NCBIXML is. The rest ... next time. ;-)

  I see, hsp.sbjct_start is renamed to hsp.hit_start ... How about exposing
*additionally* the data through "old" names? So that "SearchIO" would expose
both hsp.hit_start and also hsp.sbjct_start ... and maybe even more if blat
or hmmer parsers expose so far other attributes as well. I know it is ugly but
makes the transition smoother. ;-) Those are just references. ;-) I would
like to do something like:

try:
    # latest and greatest biopython version installed
    from Bio import SearchIO
except ImportError:
    # some old installation
    from Bio import SeqIO

but have the rest of my code unchanged. Umm, I use NCBIXML.parse() so I won't need
even the above. You just change it in your git branch and I won't have to touch my
code. That's fair, isn't it? ;-)

  Did you profile biopython or SearchIO yourself?
Best,
Martin

Wibowo Arindrarto wrote:
> Hi Martin,
> 
> There is actually already a faster BLAST XML parser written using
> cElementTree in Biopython :) (although it's yet to be included in the
> main branch). It's part of Biopython's SearchIO module that I recently
> wrote (the name SearchIO might change in the future). And indeed, my
> early benchmarks has shown that it does perform faster.
> 
> This branch is available here:
> https://github.com/bow/biopython/tree/searchio. I've also written a
> draft tutorial on how to use it here:
> http://bow.web.id/biopython/Tutorial.html#htoc96.
> 
> However, as it's not yet in the current branch, you need to do a
> little bit of command line work to set it up:
> 
> 1. Set up a new virtualenv environment (so that it doesn't clash with
> your other Biopython installation) and activate it.
> 2. Clone the repository: `git clone
> https://github.com/bow/biopython.git`, checkout the 'searchio' branch
> 3. Run `python setup.py develop`. This will keep the
> installation in-sync with any future `git pull` you might perform on
> the branch.
> 
> Hope this helps :),
> Bow
> 
> 
> On Thu, Sep 13, 2012 at 5:20 PM, Martin Mokrejs
> <mmokrejs at fold.natur.cuni.cz> wrote:
>> Hi,
>>   I am using "blastall -p blastn ... -m 7" to yield about 100GB large XML files
>> which are then parsed by
>>
>>     from Bio.Blast import NCBIXML
>>     _blastn_fileh = open(blast_out_xml_filename)
>>     _blastn_iterator = NCBIXML.parse(_blastn_fileh)
>>     _record = _blastn_iterator.next() # fetch the very first BLAST result from generator
>>
>>   In my case the blastn searches seem to take longer than takes the XML parsing. :(
>> I do not have timing numbers here but wonder why is cElementTree used only in Uniprot
>> biopython modules and not in SeqIO. What XML parsing library is my biopython-1.59 using?
>> Isn't there any argument when setup.py is called to discern between elementtree, cElementTree
>> which I think use expat ...? I am writing this a bit from top of my head hoping Peter ;-)
>> or somebody else will know right away where to look for a performance bottleneck
>> and where to change code to use cElementTree which always seemed the fastest to me.
>> Thank you for some initial advice.
>> Martin
>> P.S.: And yes, I would love to parse blastn plaintext output or some other more compact one,
>> the XML is really an overkill.
>> _______________________________________________
>> Biopython mailing list  -  Biopython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
> 
>