[BioPython] scanner generator

Andrew Dalke dalke@bioreason.com
Wed, 06 Oct 1999 01:00:39 -0600


Now that I know Tim Peters reads this list, it's a good time to
bring up parsers <wink>.

Jeff Chang suggested to me a few months ago the idea of writing
scanner generators for the different file formats we have to deal
with, especially the "not designed for machine parsing" formats,
like BLAST output (hence his http://biopython.org/Projects/BlastTests/
project).

I know he's been working on code for doing this, so I hope he'll
chime in.  I wanted to mention some of my ideas on the topic, based
from his suggestions and some previous work I've done in parsing
files.

Most of the file formats we deal with (except XML, ASN.1 and a few
others, nearly all of which are designed to be machine parsable)
are line oriented.  The most precise way I can define "line oriented"
is that there is a one-to-one mapping between every line and some
sort of identifier for that line.

The easiest example is for something like the PROSITE and SWISSPROT
formats, like http://www.expasy.ch/cgi-bin/get-prosite-entry?PS00365
where they've already nicely labeled each line type.  However,
with a little thought it is straight-forward to assign labels to
each of the lines of a BLAST result, like "version line", "hit list
header", "hit list element", etc.

So another way to think of one of these files is as the identifiers.
Using PROSITE as an example, every record is of the form

ID + AC + DT + DE + PA* + MA* + RU* + NR* + CC* + DR* + 3D* + DO + "//"

where "*" means "0 or more" and "+" means "followed by".  A slightly
more complicated form might take into account that PA and MA records
do not appear in the same record, as in

ID + AC + DT + DE + (PA*|MA*) + RU* + NR* + CC* + DR* + 3D* + DO + "//"

I wrote these expressions this way to show that you can consider them
as a regular expression, of sorts, only, instead of characters as
the basic data type, it is a line label.  To get this, you need a
lexer which identifies that a line of text can be given a label, so

NR   /TOTAL=20(20); /POSITIVE=20(20); /UNKNOWN=0(0); /FALSE_POS=0(0);

is recognized as a "NR" line.  A more complicated lexer may even do
some line-based parsing, like extracting the counts.  (This could
happen if the lexer is re based; you don't want to throw the match
information away if you can reuse the data elsewhere.)

Given the lexer and the regular expression, it's straightforward to
create a DFA or NFA which uses the lexical tokens to drive the
transitions.

What this gives you is an easy way to create parsers for a wide
range of uses.  Define the line lexer and the line orderings and
you can get a parser for that format.

In fact, you can have several different parsers for the same format.
For example, suppose you want to write the SWISSPROT to FASTA
converter.  It only needs to read the AC, DE and SQ lines -- it
can skip the others.  As another case, you may want to convert the
record into HTML so want to parse each line to add hyperlinks; this
calls for a more complicated parser.

What's missing in my description is the "thing that does stuff."
I've only described the DFA and the tokens, but there needs to be
something to produce the FASTA file or the HTML output.  This is
a callback object which is used every time a new node is visited in
the DFA, and takes the line and the lexer data (if any) as parameters.

Actually, Ihab Awad <ihab@mail.ahc.umn.edu> pointed out during OiB
that in the Design Patterns book, this is the Visitor Pattern.

BTW, I gave a better detail of how the callback can be used in
  http://www.deja.com/getdoc.xp?AN=414575448&fmt=text
and Wayne Parrott gave a different interpretation in
  http://www.deja.com/getdoc.xp?AN=415015222&fmt=text

This is from about a year ago, and before I considered the
possibility of creating "parse_fasta_record" from a description
and implemented as a finite automata.

That newsgroup example does bring up another concept.  In real life
I found it was nice to have additional nodes in the DFA; kinda like
epsilon transitions.  For example, parts of the BLAST output fit
well into hierarchical framework; results are a set of matches,
matches are one or more alignments, alignments are one or more 
pairs of subsequences.  It was useful to add special statements
in the parser to indicate I had entered or left one of the layers
in the hierarchy.

This can be thought of as having the input lines be the leaf nodes in
a tree, and the levels of the hierarchy being intermediate nodes in
the tree.  As a regular expression implementation, the latter is
something like creating a named grouping.

A couple more things, mentioned in the newsgroup references.  This
style of doing things is very similar to that used in XML parsers,
since the XML parser visits the nodes of a tree and has the callback
element do "something" with the results.  It is also useful to have
some way to abort politely, so if the format is wrong the callback
can clean up, if needed.

Finally, as another thought, this could be used to resolve the
version identification problem with every point release of FASTA
and BLAST.  The current way people identify the version is to have
lots of little "if" statements in the code to recognize the different
versions. This is a maintenence headache.

Instead, it's easy to build a version detector by building a NFA
which is the union of all the known BLAST versions, and when the
NFA is no longer "N" (so there is only one possible version
identified) then it can stop, saying "Okay, this is NCBI BLAST
version 2.0.4".


Once evening I hacked up a version which produces a PROSITE lexer
and a way to check the ordering (not a DFA/NFA, which was suggested
to me after talking with Jeff.)  The PROSITE description is all
in Python syntax, as in 

prosite = """

id = BEGIN("ID") + STRING("ID") + STRING("  ") + TO_END
ac = BEGIN("AC") + STRING("AC") + STRING("  ") + TO_END
dt = BEGIN("DT") + STRING("DT") + STRING("  ") + TO_END
de = BEGIN("DE") + STRING("DE") + STRING("  ") + TO_END
pa = BEGIN("PA") + STRING("PA") + STRING("  ") + TO_END
ma = BEGIN("MA") + STRING("MA") + STRING("  ") + TO_END
ru = BEGIN("RU") + STRING("RU") + STRING("  ") + TO_END
nr = BEGIN("NR") + STRING("NR") + STRING("  ") + TO_END
cc = BEGIN("CC") + STRING("CC") + STRING("  ") + TO_END
dr = BEGIN("DR") + STRING("DR") + STRING("  ") + TO_END
_3d = BEGIN("3D") + STRING("3D") + STRING("  ") + TO_END
do = BEGIN("DO") + STRING("DO") + STRING("  ") + TO_END
termination = BEGIN("termination") + STRING("//") + TO_END

exact_record = id + ac + dt + de + REPEAT(pa) + REPEAT(ma) + REPEAT(ru) + \
        REPEAT(nr) + REPEAT(cc) + REPEAT(dr) + REPEAT(_3d) + do + \
        termination

WRITE(exact_record, language="python", filename="prosite_lexer.py")
"""

This string is exec'ed to do everything.  The "BEGIN" sets the
label for the line, and BEGIN, STRING, and TO_END are objects of

class Field:
    def __init__(self, has_begin=0, has_end=0):
        assert has_begin + has_end != 2, "can only specify one of begin or end"
        self._has_begin = has_begin
        self._has_end = has_end

    def __add__(self, other):
        # can only add two fields
        if not isinstance(other, Field):
            raise TypeError
        return FieldNode([self, other])

when a BEGIN and END are added together, they create a Line class,
which uses the same tricks to create a parse tree of lines.
Then "WRITE" takes a parse tree and uses it to create the parser.

I've actually got this working, though only with formats like
PROSITE which don't have "or"s.  If you have an account on
the bioperl/biopython/bioxml/biojava/whatever machine, it's
in ~dalke/line_lexer.tar.gz .

A cool thing about this approach is the back-end can support
various languages, so if someone wants to write a lex output driver
for C, or something for Perl, they can go ahead.  My current idea
is to emit something as input to Marc-André Lemburg's mxTextTools
  http://starship.skyport.net/~lemburg/mxTextTools.html
but I have to learn how that works, first.

  (Tim, you still paying attention?  :)

I'm planning on working on this project starting in mid-November.

Anyone have any comments?

Also, I know Icarus in SRS also does something like this (see
 http://srs6.ebi.ac.uk/srs6/man/srsman.html for more info).
They even list a set of data files in
 http://srs6.ebi.ac.uk/srs6bin/cgi-bin/wgetz?-page+databanks
Roger Sayle mentioned that I should also write something which
parses Icarus files to generate parse trees for this scanner
generator.  I think I'll skip on that one for now.  (I shudder
when I tried to read the Icarus files!)

						Andrew Dalke
						dalke@acm.org