[Bioperl-l] Porting Entrez Gene parser to Biojava, Biopython, Biophp, even C++

Sun Mar 13 14:07:55 EST 2005

Mingyi Liu wrote:
> I forgot to mention another advantage of having a purely regex based 
> small parser means very easy porting into any language that supports 
> perl styled regular expressions, like Java, Python, PHP, C++ with PCRE 
> (used by php and python).

I developed Martel ( http://www.dalkescientific.com/Martel/ ) to
do just this sort of thing - describe a typical bioinformatics
file format as a set of declarations instead of as a set of code.

It works but turns out to be hard to maintain.  Here's a
list of problems I came up with

   - regexps are hard to write and debug
       Could be improved with some sort of development/
       testing environment

   - Martel's grammars are hard to edit
       When a grammar changes it's not possible to say "the
       new format is the old format but change this one
       bottom level node".  I'm actually considering
       switching over to a DOM-style description of the
       tree so I can use XSLT as the editing language.
       Except that I think XSLT's grammar is clumsy and ugly.

   - Martel needs everything in memory
       I implemented a hack to parse a record at a time but
       it's a hack and fails (except on large memory machines)
       for people who want to read a chromosome at a time.
       I would also like it to be feed based instead of
       pull based.

I found that normal regular expressions weren't quite powerful
enough to handle the format so needed to implement a new
feature for some file formats which include a count of the
N, the number of records followed by N repeats of those counts.

When I wrote my grammars I did so in strict mode, and reported
a bunch of errors to the database providers.  The advantage
is that wrong formats aren't accidently parsed.  The disadvantage
is that minor changes break the parser.

I don't see any solution to this other than having someone
track the file formats over time.

> There could potentially be performance hit to any perl parsers
> ported into those languages.  Mainly because AFAIK there is a
> lack of full support for all the modifiers for Perl regex, so
> unless I missed something, we'd have to either code some modifier
> logic in the program or use string replacement.

I looked at the regexps.  The ones that Python doesn't
support are \G and the compilation flags /cg .  They won't
be in Python because the start/end positions are available
as local variables and not as implicit globals.  It
uses a different stylism.

Years ago I did some timing tests for parsing SWISS-PROT
records using a large number of parsers (~20).  I found
a wide range of timings, from 1 minute to 40 minutes.
The diversity is because there are many different types
of things that might be done with a file.  If the task
is simple ("how many record are in this file?") then a
simple parser is all that's needed.

http://biopython.org/pipermail/biopython/2001-January/000472.html
http://biopython.org/pipermail/biopython-dev/2001-January/000257.html

The first of these lists some tasks that can't be done
with your approach, like being able to index all the
records in a file by byte position.

Parsers can also get better performance by assuming the
file format is correct.  Eg, your EntrezGene.pm doesn't
detect if the file was truncated (I fed it only the first
1000 lines of the human genome file) while the context-free
parsers you have will at least generate an error that
the parenthesis are unbalanced.

One thing I note, investigating a question of Hilmar's,
is that your tokenization of strings isn't quite complete.
Double-quoted "strings" that contain a double quote are
escaped ""with doubled"" double quotes.  Your tokenizer
doesn't convert the double quotes into single ones.  My
Martel code has the same problem.  It needed another
layer to describe how to unescape strings and handle
word spilling.

> Just some more cents (and advocation) :)

This email too is advocation.  I like the idea of having
one set of format definitions that can be shared
across the different code bases.  It's proved rather
difficult and tedious to implement.  I hope that
some of my experience will help you or the next
person working on the problem.

					Andrew Dalke
					dalke at dalkescientific.com