[Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used?

Martin Mokrejs mmokrejs at fold.natur.cuni.cz
Thu Sep 13 15:20:13 UTC 2012


Hi,
  I am using "blastall -p blastn ... -m 7" to yield about 100GB large XML files
which are then parsed by 

    from Bio.Blast import NCBIXML
    _blastn_fileh = open(blast_out_xml_filename)
    _blastn_iterator = NCBIXML.parse(_blastn_fileh)
    _record = _blastn_iterator.next() # fetch the very first BLAST result from generator

  In my case the blastn searches seem to take longer than takes the XML parsing. :(
I do not have timing numbers here but wonder why is cElementTree used only in Uniprot
biopython modules and not in SeqIO. What XML parsing library is my biopython-1.59 using?
Isn't there any argument when setup.py is called to discern between elementtree, cElementTree
which I think use expat ...? I am writing this a bit from top of my head hoping Peter ;-)
or somebody else will know right away where to look for a performance bottleneck
and where to change code to use cElementTree which always seemed the fastest to me.
Thank you for some initial advice. 
Martin
P.S.: And yes, I would love to parse blastn plaintext output or some other more compact one,
the XML is really an overkill.



More information about the Biopython mailing list