[BioPython] scanner generator

Andrew Dalke dalke@bioreason.com
Wed, 06 Oct 1999 12:44:45 -0600


Jeff said:
> These problems inspired Andrew (I think) to write his UPDB program,
> which would generate scanners for the PDB format based on the PDB
> specification.

My problem was I had written so many parsers for the PDB format,
each specialized to a subset, that I wanted to get it over with.
However, in that case, the lexer is really quite easy -- look
at the first 6 characters.  It was the parser that was the problem.

> However, my contention is that bioinformatics text data formats
> are relatively "simple" from a language design point of view.

I agree, since most of the formats can be described in a regular
grammer.

> > you need a lexer which identifies that a line of text can be given a
> > label, so
> 
> This is where Andrew and I differ somewhat.  I believe lifting the
> restriction of basing the lexer on lines would make it more generally
> useful.

The description I'm trying to develop happens to be line-oriented,
but there shouldn't be any reason a non-line oriented implementation
couldn't use what I have as the basis.

But it is true, I'm not trying for a general solution, since I
think there are many generalized parser generators available
(eg, see http://python.org/download/Contributed.html#ProgrammingTools).
I want something which simplifies the common case, but which doesn't
preclude more complete solutions later.

> So I can immediately think of two ways to approach this.  1) You try
> out all the NFA's until only one of them matches.  2) You optimize
> the union NFA, combining similarities between the versions, and only
> execute the states relevant to the one you've detected.  But if none
> of them match, you may not know which version you've found unless
> you explicitly specify which token contains version information.

When I did this before with manually written parsers, I had a special
token called "__version", which contained the version data.  This
acts as an epsilon transition and won't get called until you have
determined the path through the NFA (gotten rid of the nondeterminism).

This also gives the callback the ability to stop the traversal at the
shortest possible point needed to determine uniqueness.

> It looks very relevant, with the caveat that everything is done
> in memory (please correct me if I'm wrong).  This may be a problem,
> as BLAST results nowadays are in the order of megabytes, and
> sequencing is speeding up.

There's still memory-mapped files.  But like I said, I haven't taken
a close look at mxTextTools, yet.

						Andrew Dalke
						dalke@acm.org