[Bioperl-l] GO parsers & event driven parsing framework

Matthew Pocock matthew_pocock@yahoo.co.uk
Wed, 15 May 2002 22:44:21 +0100


Hi Chris,

Good luck with this. Anything that makes parsing streams of stuff easier 
is good IMHO.

The biojava tag-value parser does a very similar job to this. Briefly, 
it has start/endRecord, start/endTag and value methods. The whole 
document is wrapped in a start/endRecord pair. Within this, any number 
of start/endTag pairs can be fired. Within this, any number of values 
could be fired (e.g. one per list element). Also, if there is 
sub-structure, a start/endRecord pair may be fired within a tag scope. 
This then can contain its own tag events with value or record events and 
so on.

There is a standard listener that consues these events and builds a tree 
of Annotation objects. The tag-value stream is seen as just that - a 
stream of observations. The static representation is totaly decoupled (A 
tree of Annotation objects being one possible representation).

There are standard file parsers which emit these events for embl- and 
genbank-like formats.

We have a few usefull listeners that forward events down a chain but 
mutate or intercept the events (e.g. split a single value into multiple 
values by splitting on comma, or drop all sub-trees under a given tag, 
replace values with newly built objects).

I think it should translate quite cleanly to perl, if this is the sort 
of thing you want. Regular expression transducing listeners make 
short-work of most of the mess we encounter daily.

Matthew

Chris Mungall wrote:
> I'm rewriting the GO-text parsers (currently part of the GO software
> toolkit on sourceforge) and will probably commit these to bioperl.
> 
> I'm using a lightweight event driven framework, and obviously it would
> make sense to use the same framework for any rewrite of the current SeqIO
> parsers. I'll outline my method, and will happily change it to fit into
> another framework if anyone has any suggestions.
> 
> Initially I'd like to check in just the event-driven parsing part, and
> think about graph/ontology object models later.
> 
> The GO-text parser (and other parsers following the framework) generate
> nested events. I'm using the term "event" to be SAX friendly, but really
> these are just trees.
> 
> Let's take an imaginary GO graph; GO-style structured controlled
> vocabularies (see ftp://ftp.geneontology.org/pub/go/ontology) are often
> stored in the gotext flatfile format. This is a somewhat ad-hoc format
> that uses indentation to represent graphs as trees; multiple parentage is
> either represented as duplicate subtrees or in the detail line.
> 
> $Gene_ontology ; GO:0000001
>  %function ; GO:0000002
>   %enzyme ; GO:0000003
> 
> Other more robust formats are possible (n3, rdf, oil, daml, etc) but
> gotext is already prevalent and it is important to be able to parse it.
> You get used to it after a while - the above means "enzyme" is a subtype
> of "function" is a subtype of the "Gene_ontology" general type.
> 
> The GoText parser would zip through this, firing nested events in the
> following structure:
> 
> [go
>   [term
>     [name 'function']
>     [acc  'GO:0000002']
>   ]
>   [term
>     [name 'enzyme']
>     [acc  'GO:0000003']
>     [relationship
>       [type isa]
>       [obj 'GO:0000002']
>     ]
>   ]
> ]
> 
> e.g.  the parsing code starts off by calling
> 
> $self->start_event("go")
> 
> when a new term is encountered, it fires this
> 
> $self->event("name", $name)
> 
> at the end it says
> 
> $self->end_event("go")
> 
> the event handler can be overridden to intercept any of these; the default
> handler will just catch and nest all the events and return a tree as
> above.
> 
> A similar event tree for an EMBL record may look something like this:
> 
> [embl-set
>   [entry
>     [locus Blah]
>     [accession AFnnnnnn]
>     [ftable
>       [feature
>         [type mRNA]
>         [location
>           [from 100]
>           [to 200]
>         ]
>         [location
>           [from 300]
>           [to 400]
>         ]
>         [strain blah]
>         [product shuggy]
> ...
> 
> I feel strongly that the default event mechanism should be as lightweight
> as possible, using nested perl array references. Of course, one could
> easily swap in a lightweight xml element generator, and then use xslt on
> top of that. The above trees would just be attribute-less xml. We could
> use dtds/xmlschema to describe the various formats at the event level, but
> they would have to be attribute-less (this is consistent with ncbi xml
> but not the biojava way of doing things).
> 
> I should be upfront about my bias here - xml and all it's associated
> bloated technology is just lisp recapitulated in a shockingly bad way. I'm
> strongly against introducing dependencies on any unneccessary xml
> constructs for this framework. Of course, like it or not, xml and xslt are
> important and well-supported technologies, so the framework should play
> well with them, it just shouldn't be dependent on them.
> 
> Assuming I haven't alienated everyone already, what will the namespaces
> look like?
> 
> Of course, the interface to SeqIO et al should remain exactly the same. If
> you want objects from a file, you don't care one way or another whether
> the underlying framework is event driven or not.
> 
> So, how about something like this:
> 
> Bio/
>         Handlers/
>                 HandlerI
>                 BaseHandler
>                 XmlOutHandler
>         SeqIO/
>                 EMBL
>                 GenBank
> 		Swiss
>                 Parsers/
>                         EMBLParser
>                         GenBankParser
> 			SwissParser
>                 Handlers/
>                         EMBL2GenericSeqRecord
>                         GenBank2GenericSeqRecord
>                         GenericSeqRecord2Obj
>                         GenericSeqRecord2Fasta
> 	OntologyIO/
> 		GoText
> 		Parsers/
> 			GoTextParser
> 			GoRdfParser
> 		Handlers/
> 			Go2Obj
>         SearchIO/
>                 Parsers/
>                         Blast
>                         Blastxml
> 
> The format-specific handler classes inherit from BaseHandler, and turn the
> events into objects. Essentially they do the same thing as xslt, but in
> perl. perl can do a reasonable job of impersonating lisp, so this is ok.
> Or you can just get the parser to generate xml elements and use an actual
> xslt transformer. Or mix and match the bio* code.
> 
> bioperl-db could have it's own handlers for turning events straight into
> insert statements.
> 
> Ok, maybe the best thing for me to do is commit the GO side of things and
> let you all decide for yourself what the various merits etc are?
> 
> --
> Chris
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>