[Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used?
Peter Cock
p.j.a.cock at googlemail.com
Fri Sep 14 08:31:31 UTC 2012
On Fri, Sep 14, 2012 at 9:12 AM, Martin Mokrejs
<mmokrejs at fold.natur.cuni.cz> wrote:
> Hi all,
> as a long-term subscriber to this list and bioperl in the past as well I do know
> that the plaintext output is being changed silently and that it is a hassle to
> maintainers. On the other hand, the XML tags and syntax is way too verbose.
> That in turn means lots of disc&memory IO, long parsing times and of course file size.
> At least if the XML tags would be scrambled to be shorter strings. ;-)
> Umm, I also hit a bug in legacy blastn XML output, still no answer from NCBI:
> https://redmine.open-bio.org/issues/3354
Earlier this week the NCBI released BLAST 2.2.27+ which might
fix this...
> A real case. An SFF file has 288MB in size. An extracted FASTA file with 180271
> sequences takes 60MB. Low-complexity masking takes 6 minutes. Legacy blastn search
> using 59 queries through dataset that takes 17 minutes and yields XML with 3957MB
> in size. Parsing the XML file through biopython takes 56 minutes to convert the
> results into my own CSV file (some overhead could be my program, sure). Doing
> a full Smith-Waterman search using 8 queries takes just 126 minutes. The times
> are from filestamps so it is a wall-clock time. I will try to find some time in
> a week or so and do run profiling using runsnake
> (http://www.vrplumber.com/programming/runsnakerun/).
> And test the new parser from Wibowo and report back. ;-)
Great :)
> With plaintext I actually meant more some tabular output format which would
> be enough for my purposes (match and query coordinates, scores, gaps, identities).
>
I find the BLAST+ tabular output very useful - you can control which
columns you get if the default 12 are not enough - and trivial to parse.
This is also supported in Bow's SearchIO branch.
Peter
More information about the Biopython
mailing list