[Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used?

Wibowo Arindrarto w.arindrarto at gmail.com
Sat Sep 15 13:22:48 UTC 2012


Hi guys,

> > 2) If we add a function to Biopython that generates Blast plain-text
> > output (or something close to it) from Blast XML output, then a user can
> > generate the Blast output in XML format, parse it with Biopython,
> > optionally
> > filter it, and then generate the corresponding plain-text output;
>
> The new 'SearchIO' results objects str/repr should be familiar to
> anyone who has looked at the plain text BLAST output - but
> not identical. We could apply some of these improvements
> to the current BLAST parsers, but I favour aiming to simply
> deprecate them in favour of 'SearchIO' (namespace to be
> decided).
>
> However, we certainly could try and offer a plain-text BLAST
> output format from 'SearchIO', although IIRC Bow has not tried
> that yet. It shouldn't be too complicated - unless you aim for
> 100% agreement with the latest BLAST output (moving target).

Yes, this has not been attempted ~ mostly because I feel that the
BLAST plain text is indeed a moving target. But, if we are in favor of
choosing one format from one BLAST version and always stick to it, it
sounds more reasonable.

There are one missing detail that is only present in the plain text
format, though: the hit-level e-values. If we do decide to write a
plain text writer, we either have to demand the user supply these
values, or we omit the entire hit-level e-value table, or we fill it
with something else.

> Another idea we touched on was deprecating the current old,
> complex but flexible plain text parser while adding a new simpler
> plain text parser as part of 'SearchIO'. Here we could target only
> the recent BLAST+ output (and perhaps if not so different the
> final 'legacy' BLAST release), and not worry about all the variants
> the NCBI have produced over the years. I would hope this would
> also be faster [especially as currently 'SearchIO' supports parsing
> plain text BLAST on top of the existing old parser].

This wasn't attempted as well, mostly because I feel that a lot of
people still use legacy BLAST (we've had more legacy-BLAST related
emails rather than BLAST+ ones in the past few months, I think). Also,
the current parser wins on flexibility. I think the test cases include
BLAST versions from 2002 (10 years ago!) up to BLAST 2.2.25+. So like
Peter mentioned, the current SearchIO BLAST plain text parser is
actually a simple wrapper over Bio.Blast.NCBIStandalone.

We might be able to create a newer, speedier parser, but making it as
flexible as our current one seems difficult.

regards,
Bow



More information about the Biopython mailing list