[Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used?

Wibowo Arindrarto w.arindrarto at gmail.com
Sun Sep 16 14:21:52 UTC 2012


Hi Michiel,

We have a draft tutorial that I'm temporarily hosting here:
http://bow.web.id/biopython/Tutorial.html#htoc96. The internal
functions have also been documented with docstrings and quick examples
(e.g. https://github.com/bow/biopython/blob/searchio/Bio/SearchIO/__init__.py).

At the moment, the SearchIO API is very similar to SeqIO and AlignIO,
though in the future this is still subject to change.

Hope this helps :), otherwise let me know which part is specifically
unclear for you.

regards,
Bow

On Sun, Sep 16, 2012 at 3:54 PM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
> Hi Bow,
>
> Is there some documentation somewhere for the SearchIO module? I have a hard time understanding what it does and how it relates to Blast.
>
> Thanks,
> -Michiel.
>
> --- On Sat, 9/15/12, Wibowo Arindrarto <w.arindrarto at gmail.com> wrote:
>
>> From: Wibowo Arindrarto <w.arindrarto at gmail.com>
>> Subject: Re: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used?
>> To: "BioPython Mailing List" <biopython at lists.open-bio.org>
>> Date: Saturday, September 15, 2012, 9:22 AM
>> Hi guys,
>>
>> > > 2) If we add a function to Biopython that
>> generates Blast plain-text
>> > > output (or something close to it) from Blast XML
>> output, then a user can
>> > > generate the Blast output in XML format, parse it
>> with Biopython,
>> > > optionally
>> > > filter it, and then generate the corresponding
>> plain-text output;
>> >
>> > The new 'SearchIO' results objects str/repr should be
>> familiar to
>> > anyone who has looked at the plain text BLAST output -
>> but
>> > not identical. We could apply some of these
>> improvements
>> > to the current BLAST parsers, but I favour aiming to
>> simply
>> > deprecate them in favour of 'SearchIO' (namespace to
>> be
>> > decided).
>> >
>> > However, we certainly could try and offer a plain-text
>> BLAST
>> > output format from 'SearchIO', although IIRC Bow has
>> not tried
>> > that yet. It shouldn't be too complicated - unless you
>> aim for
>> > 100% agreement with the latest BLAST output (moving
>> target).
>>
>> Yes, this has not been attempted ~ mostly because I feel
>> that the
>> BLAST plain text is indeed a moving target. But, if we are
>> in favor of
>> choosing one format from one BLAST version and always stick
>> to it, it
>> sounds more reasonable.
>>
>> There are one missing detail that is only present in the
>> plain text
>> format, though: the hit-level e-values. If we do decide to
>> write a
>> plain text writer, we either have to demand the user supply
>> these
>> values, or we omit the entire hit-level e-value table, or we
>> fill it
>> with something else.
>>
>> > Another idea we touched on was deprecating the current
>> old,
>> > complex but flexible plain text parser while adding a
>> new simpler
>> > plain text parser as part of 'SearchIO'. Here we could
>> target only
>> > the recent BLAST+ output (and perhaps if not so
>> different the
>> > final 'legacy' BLAST release), and not worry about all
>> the variants
>> > the NCBI have produced over the years. I would hope
>> this would
>> > also be faster [especially as currently 'SearchIO'
>> supports parsing
>> > plain text BLAST on top of the existing old parser].
>>
>> This wasn't attempted as well, mostly because I feel that a
>> lot of
>> people still use legacy BLAST (we've had more legacy-BLAST
>> related
>> emails rather than BLAST+ ones in the past few months, I
>> think). Also,
>> the current parser wins on flexibility. I think the test
>> cases include
>> BLAST versions from 2002 (10 years ago!) up to BLAST
>> 2.2.25+. So like
>> Peter mentioned, the current SearchIO BLAST plain text
>> parser is
>> actually a simple wrapper over Bio.Blast.NCBIStandalone.
>>
>> We might be able to create a newer, speedier parser, but
>> making it as
>> flexible as our current one seems difficult.
>>
>> regards,
>> Bow
>> _______________________________________________
>> Biopython mailing list  -  Biopython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
>>



More information about the Biopython mailing list