[BioPython] scanner generator

Jeffrey Chang jchang@SMI.Stanford.EDU
Wed, 6 Oct 1999 10:59:50 -0700 (PDT)


> I know he's been working on code for doing this, so I hope he'll
> chime in.

No problem!  ;)

Yes, bioinformatics is a very fertile field for experimenting with all
sorts of icky parsing programs.  There are many different text data
formats, and they're changing *constantly*.  Thus, there is a lot of
repeated work as everyone is creating little tiny bits of parsers that
grab just the fields that they're currently interested in.

This is one of the motivations behind bioperl.org -- to consolidate all
these little tools for the community.

However, this doesn't solve the maintenance problem.  The file formats
aren't stable; they constantly change to accomodate new types of data.
This creates a maintenance nightmare as you're trying to keep your parsers
up to date, but still able to parse old formats as necessary.

These problems inspired Andrew (I think) to write his UPDB program, which
would generate scanners for the PDB format based on the PDB specification.

Generalizing that idea, Andrew and I have been talking about creating
scanner generators that can take specifications of text files and create
generators to grab the relevant tokens.

As Andrew mentioned, Icarus does that for SRS.  However, my contention is
that bioinformatics text data formats are relatively "simple" from a
language design point of view.  There are very few constructs that can't
be pulled out with a scanner (as opposed to a parser).  Even the
structures that require a parser (e.g. phylogenetic trees) would benefit
from a scanner, with the simplest hand-crafted parser to maintain the
semantic meaning of the tokens.

Icarus is a full-blown parser generator.  Although I can see why they need
that (they need to preserve the semantics to do the cross linkings), I
believe that for general use, a scanner generator would retain 90% of the
functionality while giving big wins on the syntax.  Regular languages are
more familiar to people than BNF-type specifications.

Thus, I have written a scanner generator called "Forager".  It is
currently fully implemented and usable for scanning files.  I have created
a specification for SWISS-PROT r37 entries.  However, it's still pre-alpha
software, because I may make some API changes soon.  It is available under
the biopython license (like python, bsd-ish).

http://www.biopython.org/unsupported/Forager.tar.gz


> you need a lexer which identifies that a line of text can be given a
> label, so

This is where Andrew and I differ somewhat.  I believe lifting the
restriction of basing the lexer on lines would make it more generally
useful.  It's easier to specify the structures of data files that are not
line based, or not clearly lines based, such as ASN.1, XML, or prodoc.

However, Andrew has previously pointed out some difficulties such as:
- more complicated scanner for the lexer
- harder to optimize
- have to deal with the platform-dependent newline problem

Admittedly, this is more work, but Forager deals with them.



> 
> NR   /TOTAL=20(20); /POSITIVE=20(20); /UNKNOWN=0(0); /FALSE_POS=0(0);
> 
> is recognized as a "NR" line.  A more complicated lexer may even do
> some line-based parsing, like extracting the counts.

Yes, this is what Forager does.  You specify a regular expression that
recognizes this line and pulls out the counts (or whatever) as named
groups.

Incidentally, the regular expression library (re.py) is not robust enough
to do this.  There's a short, cryptic ;) discussion of it in the README
file in Forager.


> In real life I found it was nice to have additional nodes in the DFA;
> kinda like epsilon transitions. 

Yep!  That's exactly what happens in Forager!  So it's not exactly a DFA.


> The current way people identify the version is to have
> lots of little "if" statements in the code to recognize the different
> versions. This is a maintenence headache.

I agree.


> Instead, it's easy to build a version detector by building a NFA
> which is the union of all the known BLAST versions

Hmmm, this is an interesting idea.  I've been struggling with the question
of how to embed feature-type information.  So I can immediately think of
two ways to approach this.  1) You try out all the NFA's until only one of
them matches.  2) You optimize the union NFA, combining similarities
between the versions, and only execute the states relevant to the one
you've detected.  But if none of them match, you may not know which
version you've found unless you explicitly specify which token contains
version information.


> If you have an account on the bioperl/biopython/bioxml/biojava/whatever
> machine, it's in ~dalke/line_lexer.tar.gz . 

I've put this on the biopython web server as well.  It's available at:
http://www.biopython.org/unsupported/line_lexer.tar.gz.


> My current idea is to emit something as input to Marc-André Lemburg's
> mxTextTools
>   http://starship.skyport.net/~lemburg/mxTextTools.html

I looked at this.  It looks very relevant, with the caveat that everything
is done in memory (please correct me if I'm wrong).  This may be a
problem, as BLAST results nowadays are in the order of megabytes, and
sequencing is speeding up.

Jeff