[Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used?

Michiel de Hoon mjldehoon at yahoo.com
Sun Sep 16 16:24:36 UTC 2012


Hi Bow,

Thanks for the links! This is actually the first time I looked at the SearchIO module in detail.

I noticed that there is a large overlap between the functionality in Bio.Blast and the SearchIO module. We should definitely avoid having two sets of Blast parsers; as the recent discussion shows, one set of Blast parsers is hard enough already.

So I would strongly suggest to integrate the SearchIO module with Bio.Blast. Here "integrate" could mean as little as using the Bio.Blast name space, and making sure we don't lose any functionality. (Or we could pick a better name than Bio.Blast, since SearchIO also includes blat, exonerate, etc.; but since Blast is the most important one perhaps using Bio.Blast for all of them is OK). The final outcome would then be that the parsers currently in SearchIO will replace the parsers currently in Bio.Blast.

Also I noticed that SearchIO (like Bio.Blast) uses attributes to store information. I would much rather see a dictionary-like interface. This has the advantage that we can keep the key name much closer to what is in the original file (for example, no need to replace '-' by '_'), and also users can call .keys() to find out what is stored in the object.

Best,
-Michiel.


--- On Sun, 9/16/12, Wibowo Arindrarto <w.arindrarto at gmail.com> wrote:

> From: Wibowo Arindrarto <w.arindrarto at gmail.com>
> Subject: Re: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used?
> To: "Michiel de Hoon" <mjldehoon at yahoo.com>
> Cc: "BioPython Mailing List" <biopython at lists.open-bio.org>
> Date: Sunday, September 16, 2012, 10:21 AM
> Hi Michiel,
> 
> We have a draft tutorial that I'm temporarily hosting here:
> http://bow.web.id/biopython/Tutorial.html#htoc96. The
> internal
> functions have also been documented with docstrings and
> quick examples
> (e.g. https://github.com/bow/biopython/blob/searchio/Bio/SearchIO/__init__.py).
> 
> At the moment, the SearchIO API is very similar to SeqIO and
> AlignIO,
> though in the future this is still subject to change.
> 
> Hope this helps :), otherwise let me know which part is
> specifically
> unclear for you.
> 
> regards,
> Bow
> 
> On Sun, Sep 16, 2012 at 3:54 PM, Michiel de Hoon <mjldehoon at yahoo.com>
> wrote:
> > Hi Bow,
> >
> > Is there some documentation somewhere for the SearchIO
> module? I have a hard time understanding what it does and
> how it relates to Blast.
> >
> > Thanks,
> > -Michiel.
> >
> > --- On Sat, 9/15/12, Wibowo Arindrarto <w.arindrarto at gmail.com>
> wrote:
> >
> >> From: Wibowo Arindrarto <w.arindrarto at gmail.com>
> >> Subject: Re: [Biopython] Legacy blastn XML outfile
> parsing is slow. What XML parser is actually used?
> >> To: "BioPython Mailing List" <biopython at lists.open-bio.org>
> >> Date: Saturday, September 15, 2012, 9:22 AM
> >> Hi guys,
> >>
> >> > > 2) If we add a function to Biopython
> that
> >> generates Blast plain-text
> >> > > output (or something close to it) from
> Blast XML
> >> output, then a user can
> >> > > generate the Blast output in XML format,
> parse it
> >> with Biopython,
> >> > > optionally
> >> > > filter it, and then generate the
> corresponding
> >> plain-text output;
> >> >
> >> > The new 'SearchIO' results objects str/repr
> should be
> >> familiar to
> >> > anyone who has looked at the plain text BLAST
> output -
> >> but
> >> > not identical. We could apply some of these
> >> improvements
> >> > to the current BLAST parsers, but I favour
> aiming to
> >> simply
> >> > deprecate them in favour of 'SearchIO'
> (namespace to
> >> be
> >> > decided).
> >> >
> >> > However, we certainly could try and offer a
> plain-text
> >> BLAST
> >> > output format from 'SearchIO', although IIRC
> Bow has
> >> not tried
> >> > that yet. It shouldn't be too complicated -
> unless you
> >> aim for
> >> > 100% agreement with the latest BLAST output
> (moving
> >> target).
> >>
> >> Yes, this has not been attempted ~ mostly because I
> feel
> >> that the
> >> BLAST plain text is indeed a moving target. But, if
> we are
> >> in favor of
> >> choosing one format from one BLAST version and
> always stick
> >> to it, it
> >> sounds more reasonable.
> >>
> >> There are one missing detail that is only present
> in the
> >> plain text
> >> format, though: the hit-level e-values. If we do
> decide to
> >> write a
> >> plain text writer, we either have to demand the
> user supply
> >> these
> >> values, or we omit the entire hit-level e-value
> table, or we
> >> fill it
> >> with something else.
> >>
> >> > Another idea we touched on was deprecating the
> current
> >> old,
> >> > complex but flexible plain text parser while
> adding a
> >> new simpler
> >> > plain text parser as part of 'SearchIO'. Here
> we could
> >> target only
> >> > the recent BLAST+ output (and perhaps if not
> so
> >> different the
> >> > final 'legacy' BLAST release), and not worry
> about all
> >> the variants
> >> > the NCBI have produced over the years. I would
> hope
> >> this would
> >> > also be faster [especially as currently
> 'SearchIO'
> >> supports parsing
> >> > plain text BLAST on top of the existing old
> parser].
> >>
> >> This wasn't attempted as well, mostly because I
> feel that a
> >> lot of
> >> people still use legacy BLAST (we've had more
> legacy-BLAST
> >> related
> >> emails rather than BLAST+ ones in the past few
> months, I
> >> think). Also,
> >> the current parser wins on flexibility. I think the
> test
> >> cases include
> >> BLAST versions from 2002 (10 years ago!) up to
> BLAST
> >> 2.2.25+. So like
> >> Peter mentioned, the current SearchIO BLAST plain
> text
> >> parser is
> >> actually a simple wrapper over
> Bio.Blast.NCBIStandalone.
> >>
> >> We might be able to create a newer, speedier
> parser, but
> >> making it as
> >> flexible as our current one seems difficult.
> >>
> >> regards,
> >> Bow
> >> _______________________________________________
> >> Biopython mailing list  -  Biopython at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/biopython
> >>
> 




More information about the Biopython mailing list