[BioRuby] Parsing line-based formats with Ragel

Artem Tarasov lomereiter at googlemail.com
Mon Jun 4 10:31:14 EDT 2012


On Mon, Jun 4, 2012 at 4:56 PM, Fields, Christopher J <cjfields at illinois.edu
> wrote:

> > Also agree. Parsing is a common theme in Bio*. A state engine would
> > be a great abstraction, targetting C or D, and even the interpreted
> > languages. The SAM parser would be a great proof-of-concept. I am
> > also very interested to see how it will perform against samtools.
> >
> > The spanner in the works may be that we tend to be very sloppy about
> > standards. So relaxed parsers may also be needed.
>
> Either that, or use the grammar as a source of validation (e.g. if the
> parse fails, the data is not formatted correctly).  That's basicallt the
> tact I plan with perl 6 grammars.
>
> chris
>
>
Yes, I think that the problem of invalid data can be addressed by having
additional rules with less strict grammar. For instance, if the format uses
tab delimiting, we can track the problem down to a particular field, using
less restrictions on character set, like

invalidsomefield = [^\t]+ %some_error_action;
somefield = (bunch of rules conformant to spec) | invalidsomefield;

If we want more comprehendable error messages, instead of [^\t]+ another
set of rules for different kinds of invalid input can be used.

The big plus of state machines is that they don't scan string multiple
times, as it usually happens with hand-written parser when you usually do
several checks in turn.


More information about the BioRuby mailing list