[Biopython-dev] Martel-0.4 available

Wed Dec 6 04:53:23 EST 2000

Brad:
>One thing
>that I ended up doing was not using the AnyEOL test at all, and
>instead only using the \R syntax.

Admittedly, the exising AnyEol uses the old "\n" test so
won't work on non-UNIX platforms.  Still, AnyEol() should
be just as good as using Re(r"\R").  (Partially because you
really should be using a raw quoted string - it works because
Python's normal strings currently don't do anything with \R.)

>I also thought it would be nice if the RecordReader would accept \R as 
>a newline as well, so you could do something like
>RecordRecorder.EndsWith(handle, "//\R").

Some of the readers allow a trailing "\n".  This gets interpreted
to mean \R.  That changes the definition of "\n", which is probably
a bad idea.  I used "\n" because it's one character and not as
likely to be confused with other characters.  It shouldn't be
too hard to change to use \R instead.

> Even further along these
>lines, it would have been nice to be able to set the end with an
>arbitrary regular expression.

Indeed, that would be a final goal for Martel.  I can't do it.
If I could then your delimiter would be the Genbank record
definition itself and there would be no need for a RecordReader.

The problem is that I can't tell when mxTextTools reaches the
end of the string.  I would like it to ask "I've parsed this
data, got any more before I call it the end of input?".  All
I know now is that the parse failed, but it could be because
the text was in the wrong format or it needed more data to
finish the check.  I could keep on making the string larger
and larger, but when would I stop?

BTW, that "make the string larger and larger" is what I do
with the StartsWith and EndsWith.  That only works because I
know exactly the contents of the string so I know the failure
conditions, and because the record sizes are usually a lot
smaller than the lookahead buffer so I don't have the N**2
case of appending strings and retesting.

>I ran into problems with
>files like the biojava genbank test file, where there are a bunch of
>linefeeds at the end of the file, but this could be a problem with a
>file of cut'n'pasted records that had differing amounts of
>linebreaks.

If you use the HeaderFooter parser, you have an empty header
and a footer which matches "\R*".  See the PIR example which
allows a trailing \\\ .

When it reads past the final /// it will try to parse the
newlines as a record.  That will fail, so it passes the text
off to the footer parser.

Another nice thing about the Record Parsers - if there's an error
when processing a record, it's an 'error' but not a 'fatalError'.
It can recover by processing the next record.

>I have a quick question about mxTextTools importing -- you are now
>importing with:
>
>from mx import TextTools
>
>When did it get a mx meta-directory? Is this a new version or anything 
>fancy? It was no big deal, I was just curious.

Oops, didn't realize I was doing that.  I'm using a prerelease
version of mxTextTools 1.2 which changes the organization.  I
really should use just TextTools.  (1.2 has backwards compatible
support for that.)

>One thing that I didn't use is a Martel based iterator -- I just stuck 
>with the type of iterator that Jeff uses in other Biopython parsers 
>but used the RecordReader to implement it. I'm not sure if it could be 
>done in a better way with a Martel iterator...

Depends on the needs.  From what I saw of your adapter, it was
pretty straight match between the two.

>BTW, the debug_level = 2 option on the parser is incredibly nice. It
>really helps get at why a parse is failing and makes it much easier to 
>correct the problem. I probably would still be pulling my hair out
>trying to regexp right without this. Thanks!

I agree.  I was working on the PIR parser and having the correct
byte position (debug_level = 1) was wonderful.  Then when I got
really confused, I upped it to 2 to get an idea of what it was
attempting to parse.

                    Andrew
                    dalke at acm.org