[Biopython-dev] Martel

Sun Sep 17 14:39:34 EDT 2000

Cayte <katel at worldpath.net>
>  My impression of Martel is that it will require extensive testing,
because
> it has so many paths.  The tests cover the basic expressions, but I'd be
> surprised if there are no weird interactions.  The code may lose its
> context, on complicated paths.  I could help with adding unit tests.

One of the things I found during development was that it was almost
impossible to write a parser without testing each of the components against
real text.  What you are seeing is the support framework needed for that.

Concerning the number of paths; I'm not sure which paths you're talking
about.  There are two I can think of.  One is the generation of the
state table for mxTextTools and the other is the evaluation of the text
through that state table.  The first is somewhat straight-forward;
very much like unoptimized code generation from a parse tree.  It does
need documentation so others can verify my work.  The second is indeed
more complicated, but it should be almost identically complicated to
hand written parser code of equivalent abilities.

Debugging, btw, is also somewhat complicated because failures are identified
as the last character that something worked, as compared to the last
character which was used for a test.  I need to take a look at the
mxTextTools code to see if there's a way to give better position
information.

>  In a few cases, I think the names need to be more descriptive.  Variables
> like p, s or av don't give a lot of information.  Also, the name "pattern"
> is used for too many things, that have different meanings.

You're missing a few other naming clashes in my code.  I agree, it needs
a full cleanup before it is of good enough quality that I would foist it
off on most people.  The names are confusing because I was confused myself
when doing the code.  I was working with a couple of toolsets (sre_parse
and mxTexTools) which I hadn't used before, and I was changing my idea of
how things should be done based on what I learned using them.  (Not an
excuse, just history, and they do need to get fixed.)

There are two major reasons why I haven't fixed things.  One is, alas, the
lack of time.  The other is that there are a few changes I need to make to
support certain formats and needs.  I've added a "named group repeat" where
a named group can be used as the repeat count for later groups.  (This is
needed for MDL's CT format, which gives the atom and bond counts then
"atom_count" lines of atom records and "bond_count" lines of bond records.)
I also need to redo how it handles files so I can feed it a record at
a time rather than the whole data file, but without changing the SAX events.
(Jeff first suggested this one.)

So I'm still in the experimental phase to see what other changes are needed,
and I'm hoping to get feedback from others about it.  Thus, I haven't
wanted to go through the code cleaning it up until I know more about
what to change.

>  Finding self-documenting names can be hard, but sometimes the effort
> to find the right metaphor clarifies your thinking.

Yep, and yep.

                    Andrew
                    dalke at acm.org