[Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used?
Martin Mokrejs
mmokrejs at fold.natur.cuni.cz
Thu Sep 13 11:20:13 EDT 2012
Hi,
I am using "blastall -p blastn ... -m 7" to yield about 100GB large XML files
which are then parsed by
from Bio.Blast import NCBIXML
_blastn_fileh = open(blast_out_xml_filename)
_blastn_iterator = NCBIXML.parse(_blastn_fileh)
_record = _blastn_iterator.next() # fetch the very first BLAST result from generator
In my case the blastn searches seem to take longer than takes the XML parsing. :(
I do not have timing numbers here but wonder why is cElementTree used only in Uniprot
biopython modules and not in SeqIO. What XML parsing library is my biopython-1.59 using?
Isn't there any argument when setup.py is called to discern between elementtree, cElementTree
which I think use expat ...? I am writing this a bit from top of my head hoping Peter ;-)
or somebody else will know right away where to look for a performance bottleneck
and where to change code to use cElementTree which always seemed the fastest to me.
Thank you for some initial advice.
Martin
P.S.: And yes, I would love to parse blastn plaintext output or some other more compact one,
the XML is really an overkill.
More information about the Biopython
mailing list