[Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used?

Fri Sep 14 08:12:10 UTC 2012

Hi all,
  as a long-term subscriber to this list and bioperl in the past as well I do know
that the plaintext output is being changed silently and that it is a hassle to
maintainers. On the other hand, the XML tags and syntax is way too verbose.
That in turn means lots of disc&memory IO, long parsing times and of course file size.
At least if the XML tags would be scrambled to be shorter strings. ;-)
Umm, I also hit a bug in legacy blastn XML output, still no answer from NCBI:
https://redmine.open-bio.org/issues/3354

  A real case. An SFF file has 288MB in size. An extracted FASTA file with 180271
sequences takes 60MB. Low-complexity masking takes 6 minutes. Legacy blastn search
using 59 queries through dataset that takes 17 minutes and yields XML with 3957MB
in size. Parsing the XML file through biopython takes 56 minutes to convert the
results into my own CSV file (some overhead could be my program, sure). Doing
a full Smith-Waterman search using 8 queries takes just 126 minutes. The times
are from filestamps so it is a wall-clock time. I will try to find some time in
a week or so and do run profiling using runsnake (http://www.vrplumber.com/programming/runsnakerun/).
And test the new parser from Wibowo and report back. ;-)

  With plaintext I actually meant more some tabular output format which would
be enough for my purposes (match and query coordinates, scores, gaps, identities).

Martin

Fields, Christopher J wrote:
> On Sep 13, 2012, at 7:37 PM, Michiel de Hoon <mjldehoon at yahoo.com>
>  wrote:
> 
>> --- On Thu, 9/13/12, Martin Mokrejs <mmokrejs at fold.natur.cuni.cz> wrote:
>>> P.S.: And yes, I would love to parse blastn plaintext output
>>> or some other more compact one, the XML is really an overkill.
>>
>> What exactly is the advantage of plain text parsing compared to XML? File size?
>>
>> Best,
>> -Michiel.
> 
> There isn't any.  In fact, NCBI has consistently stated that one should never rely on parsing BLAST text output, primarily b/c they reserve the right to make changes to the output at any given point, whereas XML output should remain stable.  As someone who has taken care of legacy BLAST code for a number of years (BioPerl), I can state that is fairly close to the truth (the caveat being they have made changes that break some XML parsing, but they do try to fix them).  BLAST XML has simply been much easier to deal with in terms of fixing issues than text.
> 
> chris
> 
>