[Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used?

Wibowo Arindrarto w.arindrarto at gmail.com
Thu Sep 13 15:40:41 UTC 2012


Hi Martin,

There is actually already a faster BLAST XML parser written using
cElementTree in Biopython :) (although it's yet to be included in the
main branch). It's part of Biopython's SearchIO module that I recently
wrote (the name SearchIO might change in the future). And indeed, my
early benchmarks has shown that it does perform faster.

This branch is available here:
https://github.com/bow/biopython/tree/searchio. I've also written a
draft tutorial on how to use it here:
http://bow.web.id/biopython/Tutorial.html#htoc96.

However, as it's not yet in the current branch, you need to do a
little bit of command line work to set it up:

1. Set up a new virtualenv environment (so that it doesn't clash with
your other Biopython installation) and activate it.
2. Clone the repository: `git clone
https://github.com/bow/biopython.git`, checkout the 'searchio' branch
3. Run `python setup.py develop`. This will keep the
installation in-sync with any future `git pull` you might perform on
the branch.

Hope this helps :),
Bow


On Thu, Sep 13, 2012 at 5:20 PM, Martin Mokrejs
<mmokrejs at fold.natur.cuni.cz> wrote:
> Hi,
>   I am using "blastall -p blastn ... -m 7" to yield about 100GB large XML files
> which are then parsed by
>
>     from Bio.Blast import NCBIXML
>     _blastn_fileh = open(blast_out_xml_filename)
>     _blastn_iterator = NCBIXML.parse(_blastn_fileh)
>     _record = _blastn_iterator.next() # fetch the very first BLAST result from generator
>
>   In my case the blastn searches seem to take longer than takes the XML parsing. :(
> I do not have timing numbers here but wonder why is cElementTree used only in Uniprot
> biopython modules and not in SeqIO. What XML parsing library is my biopython-1.59 using?
> Isn't there any argument when setup.py is called to discern between elementtree, cElementTree
> which I think use expat ...? I am writing this a bit from top of my head hoping Peter ;-)
> or somebody else will know right away where to look for a performance bottleneck
> and where to change code to use cElementTree which always seemed the fastest to me.
> Thank you for some initial advice.
> Martin
> P.S.: And yes, I would love to parse blastn plaintext output or some other more compact one,
> the XML is really an overkill.
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython



More information about the Biopython mailing list