[Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used?

Wed Sep 19 18:10:31 UTC 2012

On Sat, Sep 15, 2012 at 9:22 AM, Wibowo Arindrarto
<w.arindrarto at gmail.com>wrote:

> Hi guys,
>
> > > 2) If we add a function to Biopython that generates Blast plain-text
> > > output (or something close to it) from Blast XML output, then a user
> can
> > > generate the Blast output in XML format, parse it with Biopython,
> > > optionally
> > > filter it, and then generate the corresponding plain-text output;
> >
> > The new 'SearchIO' results objects str/repr should be familiar to
> > anyone who has looked at the plain text BLAST output - but
> > not identical. We could apply some of these improvements
> > to the current BLAST parsers, but I favour aiming to simply
> > deprecate them in favour of 'SearchIO' (namespace to be
> > decided).
> >
> > However, we certainly could try and offer a plain-text BLAST
> > output format from 'SearchIO', although IIRC Bow has not tried
> > that yet. It shouldn't be too complicated - unless you aim for
> > 100% agreement with the latest BLAST output (moving target).
>
> Yes, this has not been attempted ~ mostly because I feel that the
> BLAST plain text is indeed a moving target. But, if we are in favor of
> choosing one format from one BLAST version and always stick to it, it
> sounds more reasonable.
>

Since NCBI is not planning to make any more changes to "legacy" blastall,
this could be an opportunity to settle on once stable plain-text BLAST
output style to parse in Bio.Search(IO), and admit that we're not going to
bother keeping up with BLAST+ plain-text reports.

(I imagine there's a certain degree of overlap between users stuck with
legacy BLAST installations and those stuck with plain-text BLAST reports.)

>
> There are one missing detail that is only present in the plain text
> format, though: the hit-level e-values. If we do decide to write a
> plain text writer, we either have to demand the user supply these
> values, or we omit the entire hit-level e-value table, or we fill it
> with something else.
>

But the Hsp-level scores or bitscores are included, right? The database
size, query length and Alschul-Karlin kappa and lambda values are included
in the BLAST XML output, so it's possible (and not difficult) to
recalculate the e-values.
http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html#head3

Note that BLAST tweaks the raw alignment score with their own heuristics,
so it's not easy to get the raw score from the alignment in the XML. But
once you have the raw score, the rest is straightforward.

Cheers,
Eric