[Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used?

Fri Sep 14 05:47:48 EDT 2012

Hi Michiel,

Michiel de Hoon wrote:
> Hi Martin,
> 
> --- On Fri, 9/14/12, Martin Mokrejs <mmokrejs at fold.natur.cuni.cz> wrote:
>> Legacy blastn search using 59 queries through dataset
>> that takes 17 minutes and yields XML with 3957MB
>> in size. Parsing the XML file through biopython takes 56
>> minutes to convert the results into my own CSV file
> 
> How does this compare to parsing human-readable plain text output? Is
> it significantly faster than the XML parser?

I don't have numbers but say mdust program (compiled from C) parsed the
FASTA file in 6 minutes so I would be happy with roughly same time needed
for parsing a CSV file having at about 1/5 of the lines in the FASTA file.
Biopython is using generators and I do that as well in my program so
the main overhead in my program is string slicing, string to int/float/list
conversion.

> 
>> With plaintext I actually meant more some tabular
>> output format which would be enough for my purposes
>> (match and query coordinates, scores, gaps, identities).
> 
> Maintaining the tabular Blast output parser has not been a problem,
> and I expect that it will continue to be supported in Biopython. On
> the other hand, maintaining the human-readable plain text parser has
> been a recurring headache. If Biopython can parse tabular Blast
> output, then do you still need the human-readable plain text parser?

Sometimes I parsed the alignment to have in hands number of matches, mismatches
(the pipes, minuses, dots) but not at this very moment. Their distribution along the
alignment is important and sometimes helpful. BTW, I hate that blastn is changing
letter-casing os the sequence in its output. ;-)

Martin