[Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used?
Michiel de Hoon
mjldehoon at yahoo.com
Sun Sep 16 13:54:37 UTC 2012
Hi Bow,
Is there some documentation somewhere for the SearchIO module? I have a hard time understanding what it does and how it relates to Blast.
Thanks,
-Michiel.
--- On Sat, 9/15/12, Wibowo Arindrarto <w.arindrarto at gmail.com> wrote:
> From: Wibowo Arindrarto <w.arindrarto at gmail.com>
> Subject: Re: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used?
> To: "BioPython Mailing List" <biopython at lists.open-bio.org>
> Date: Saturday, September 15, 2012, 9:22 AM
> Hi guys,
>
> > > 2) If we add a function to Biopython that
> generates Blast plain-text
> > > output (or something close to it) from Blast XML
> output, then a user can
> > > generate the Blast output in XML format, parse it
> with Biopython,
> > > optionally
> > > filter it, and then generate the corresponding
> plain-text output;
> >
> > The new 'SearchIO' results objects str/repr should be
> familiar to
> > anyone who has looked at the plain text BLAST output -
> but
> > not identical. We could apply some of these
> improvements
> > to the current BLAST parsers, but I favour aiming to
> simply
> > deprecate them in favour of 'SearchIO' (namespace to
> be
> > decided).
> >
> > However, we certainly could try and offer a plain-text
> BLAST
> > output format from 'SearchIO', although IIRC Bow has
> not tried
> > that yet. It shouldn't be too complicated - unless you
> aim for
> > 100% agreement with the latest BLAST output (moving
> target).
>
> Yes, this has not been attempted ~ mostly because I feel
> that the
> BLAST plain text is indeed a moving target. But, if we are
> in favor of
> choosing one format from one BLAST version and always stick
> to it, it
> sounds more reasonable.
>
> There are one missing detail that is only present in the
> plain text
> format, though: the hit-level e-values. If we do decide to
> write a
> plain text writer, we either have to demand the user supply
> these
> values, or we omit the entire hit-level e-value table, or we
> fill it
> with something else.
>
> > Another idea we touched on was deprecating the current
> old,
> > complex but flexible plain text parser while adding a
> new simpler
> > plain text parser as part of 'SearchIO'. Here we could
> target only
> > the recent BLAST+ output (and perhaps if not so
> different the
> > final 'legacy' BLAST release), and not worry about all
> the variants
> > the NCBI have produced over the years. I would hope
> this would
> > also be faster [especially as currently 'SearchIO'
> supports parsing
> > plain text BLAST on top of the existing old parser].
>
> This wasn't attempted as well, mostly because I feel that a
> lot of
> people still use legacy BLAST (we've had more legacy-BLAST
> related
> emails rather than BLAST+ ones in the past few months, I
> think). Also,
> the current parser wins on flexibility. I think the test
> cases include
> BLAST versions from 2002 (10 years ago!) up to BLAST
> 2.2.25+. So like
> Peter mentioned, the current SearchIO BLAST plain text
> parser is
> actually a simple wrapper over Bio.Blast.NCBIStandalone.
>
> We might be able to create a newer, speedier parser, but
> making it as
> flexible as our current one seems difficult.
>
> regards,
> Bow
> _______________________________________________
> Biopython mailing list - Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
More information about the Biopython
mailing list