[Biopython-dev] Martel-0.4 available
Andrew Dalke
dalke at acm.org
Wed Dec 6 04:53:23 EST 2000
Brad:
>One thing
>that I ended up doing was not using the AnyEOL test at all, and
>instead only using the \R syntax.
Admittedly, the exising AnyEol uses the old "\n" test so
won't work on non-UNIX platforms. Still, AnyEol() should
be just as good as using Re(r"\R"). (Partially because you
really should be using a raw quoted string - it works because
Python's normal strings currently don't do anything with \R.)
>I also thought it would be nice if the RecordReader would accept \R as
>a newline as well, so you could do something like
>RecordRecorder.EndsWith(handle, "//\R").
Some of the readers allow a trailing "\n". This gets interpreted
to mean \R. That changes the definition of "\n", which is probably
a bad idea. I used "\n" because it's one character and not as
likely to be confused with other characters. It shouldn't be
too hard to change to use \R instead.
> Even further along these
>lines, it would have been nice to be able to set the end with an
>arbitrary regular expression.
Indeed, that would be a final goal for Martel. I can't do it.
If I could then your delimiter would be the Genbank record
definition itself and there would be no need for a RecordReader.
The problem is that I can't tell when mxTextTools reaches the
end of the string. I would like it to ask "I've parsed this
data, got any more before I call it the end of input?". All
I know now is that the parse failed, but it could be because
the text was in the wrong format or it needed more data to
finish the check. I could keep on making the string larger
and larger, but when would I stop?
BTW, that "make the string larger and larger" is what I do
with the StartsWith and EndsWith. That only works because I
know exactly the contents of the string so I know the failure
conditions, and because the record sizes are usually a lot
smaller than the lookahead buffer so I don't have the N**2
case of appending strings and retesting.
>I ran into problems with
>files like the biojava genbank test file, where there are a bunch of
>linefeeds at the end of the file, but this could be a problem with a
>file of cut'n'pasted records that had differing amounts of
>linebreaks.
If you use the HeaderFooter parser, you have an empty header
and a footer which matches "\R*". See the PIR example which
allows a trailing \\\ .
When it reads past the final /// it will try to parse the
newlines as a record. That will fail, so it passes the text
off to the footer parser.
Another nice thing about the Record Parsers - if there's an error
when processing a record, it's an 'error' but not a 'fatalError'.
It can recover by processing the next record.
>I have a quick question about mxTextTools importing -- you are now
>importing with:
>
>from mx import TextTools
>
>When did it get a mx meta-directory? Is this a new version or anything
>fancy? It was no big deal, I was just curious.
Oops, didn't realize I was doing that. I'm using a prerelease
version of mxTextTools 1.2 which changes the organization. I
really should use just TextTools. (1.2 has backwards compatible
support for that.)
>One thing that I didn't use is a Martel based iterator -- I just stuck
>with the type of iterator that Jeff uses in other Biopython parsers
>but used the RecordReader to implement it. I'm not sure if it could be
>done in a better way with a Martel iterator...
Depends on the needs. From what I saw of your adapter, it was
pretty straight match between the two.
>BTW, the debug_level = 2 option on the parser is incredibly nice. It
>really helps get at why a parse is failing and makes it much easier to
>correct the problem. I probably would still be pulling my hair out
>trying to regexp right without this. Thanks!
I agree. I was working on the PIR parser and having the correct
byte position (debug_level = 1) was wonderful. Then when I got
really confused, I upped it to 2 to get an idea of what it was
attempting to parse.
Andrew
dalke at acm.org
More information about the Biopython-dev
mailing list