Bioperl: EMBL/GenBank parsing

Ewan Birney birney@ebi.ac.uk
Mon, 8 May 2000 09:18:49 +0100 (BST)


Both EMBL and GenBank parsing are becoming more important in 
bioperl - and frustrating to deal with. The problem hinges around
the fact that the format is very ill-defined, and trying to have
a parser that can represent everything in GenBank/EMBL nicely both
(a) puts a straightjacket on the object model and (b) ignores the
fact that certain people want to interpret the same embl/genbank
file differently (doh!).

I think we should have the following criteria for the format and
parsing

- The format should not dominate the object model design

- Where necessary, objects should be extended to accommodate some 
  of the more standard points in EMBL/GenBank but the code should 
  not assumme that objects have these extensions, using the ->can
  syntax and testing for undef

- Minor points in the format can be ignored if deemed to be too much
  to put in (see the next point)

- User defined object handlers can be defined to allow people to get
  their own interpretation out.

- The user should be in control of whether the object parser
	- throws an exception on finding a format problem
	- throws a warning but still attempts to build an object
	- throws a warning but skips to the next entry


This suggests quite an overhaul of aspects to the parsing. 


I would like to suggest that we have the following set up:


- a common base class for EMBL/GenBank/Swissprot parsing

- specific classes for each format *only* handle the parsing of
  format in a producer/consumer type manner. The parsers essentially
  provide objects whcih are
	(tag1,tag2,@lines) with

		tag1 being ID,CC etc
		tag2 being empty for everything but Feature Table, 
			where they are the key

- The common base class has hooks for parsers on hashes on tag1/tag2.
  (if no specific tag2, default to tag1)

- Another object, being a ParserController which has attributes like

	->throw_on_error
	->warn_on_error
	->skip_on_error


This would suggest quite a rewrite of the parsers, but we would gain
in flexibility. I'd like to kick this proposal around for a while -
I am sure Keith and Hilmar will have requirements to be met for the
parsers and we want to make sure we don't make this too complicated
for its own good. Volunteers for the implementation also welcome ;)


The floor is open for people to discuss this. Let's try to get to a good
proposal in 2/3 weeks time...


ewan


-----------------------------------------------------------------
Ewan Birney. Mobile: +44 (0)7970 151230, Work: +44 1223 494420
<birney@ebi.ac.uk>. 
-----------------------------------------------------------------

=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://bio.perl.org/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================