[Biopython-dev] Martel stuff

Andrew Dalke dalke at acm.org
Sat Sep 30 13:50:44 EDT 2000


Brad:
> I think this is right, but when I do this it makes the parse hang and
> never finish. Hmmm.... I'm not sure how to debug this, any ideas?

The code looks correct, except you should use "Opt(expr)" as a shorthand
for "MaxRepeat(expr, 0, 1)".

The hang you are seeing is likely a problem with Martel.  Suppose it needs
to match 0 or more times, and one of the matches can be of size 0.  Then
it will set on that spot forever, continuously eating groups of size 0.
The best way to work around the problem is to make sure that all repeat
groups are guaranteed to be able to consume a character.  Another work
around is to put an upper limit on the repeat count.

Once I get this next release out, I'll see about generating tag tables
which check the size of any match.  There will be quite a bit of overhead
for doing that, so I'm thinking of having a debug version which would
handle this and be better able to pinpoint error positions.

> it looks like we'll need the PyXML package :-< Python2.0 doesn't seem to 
> come with saxlib, which we need to implement handler classes for the XML
> produced by Martel.

What about xml.sax.handler ?

I haven't sat down with the new Python distro to see what's changed.
Again, that will wait until after I get this 0.3 release out.

> 3. What are people's thoughts about integrating Martel more tightly with
> Biopython?

Jeff says that he's for it.  I just need to (again ): get this release
out so people can start testing it.

>  Do you think it would be worthwhile for me to try my hand at
> implementing a Martel based Fasta parser that would work with the code
> Jeff has already got in place? 

Yes, and no.  The biggest change for 0.3 is support for hybrid parsers,
which uses a simple reader to grab a record at a time, then passes that
to Martel for in-depth parsing.  This reduces the amount of memory needed
to parse a file.

So the "yes" part means, go ahead and write a parser for FASTA which
produces Biopython data structures.  However, it will likely change for
the future.  In fact, for FASTA I would probably have the regexp available,
so it can be merged with other expressions, but have it create a scanner
which is pure Python generating the SAX events, rather than going through
mxTextTools.

                    Andrew





More information about the Biopython-dev mailing list