[Bioperl-l] RFC: Bio::SearchIO and Bio::Search::* objects

Thomas Down td2@sanger.ac.uk
Sun, 21 Oct 2001 13:58:03 +0100


On Sun, Oct 21, 2001 at 11:36:06AM +0100, Ewan Birney wrote:
> I think there will be a real benefit in having an underlying SAX like
> system to the parsers - like BioJava. I would imagine it like this
>
> [snip].
>
> The "dumb" users use the top level SearchIO system which makes a format
> event generator and attaches the SearchResultEventBuilder to give back an
> in memory SearchResult object
> 
> However power users can use the event generating system to build much
> quicker parsers which ignore alot of the information or map the results to
> new objects
> 
> I wish we had structured the SeqIO system this way - easy use for first
> time users with extensibility for more serious users.
> 
> More modules to write, but should fit well in the XML parsing end of the
> world.
> 
> 
> I've cc'd thomas down to get his opinion on this - thomas - what do you
> suggest?

I'd say go for it.  Event-driven schemes are nice, not only
in that they allow power users to extract subsets and summaries
of the information quickly and efficiently, they also open the
possibility of inserting `transducers' in the pipeline to rewrite
the information in some way (BioJava uses transducers to help
parse some file formats).

We've been using two different event-driven parsing systems in
BioJava for quite some time:

  - Sequence IO (a.k.a. newio in BioJava 1.1):

      This uses a SeqIOListener interface which was in some ways
      inspired by SAX, but dedicated to sequence handling.

  - Blast (and other apps) parsing, designed by the CAT people:

      This actually uses SAX interfaces.  The parsers implicity
      transform blast output into XML.  With a simple SAX -> XML
      dumper, you can actually see this XML if you want.

The `dedicated' interface route is maybe a little bit easier
to get into (unless you're already a SAX-head), but the SAX
route is potentially more flexible (in terms of being able to
extend the abstract schema of the XML you're generating without
breaking stuff).  The SAX route also has the possibility of writing
parsers for new, XML-based, formats as simple XSLT transforms
into your abstract schema -- no coding required!

But either approach seems to work well.

One issue is that you probably do need to write the event stream ->
object model code fairly early on.  For a while BioJava had a lovely
blast parser, but a lot of people were being put off by the fact that
they couldn't see the results as objects.

[In the last week, there has been a new suggestion in BioJava-land.
Namely, using a lex-like parser generator -- jflex in this case --
to generate a lot of the code of a flatfile -> SAX parser.  There
was a quite impressive example of a blast parser implemented this
way posted a few days back.  Search the BioJava mailing list archives
for LSAX if you're interested].


Hope this helps,

    Thomas.