[Biopython-dev] Eol
Andrew Dalke
dalke at acm.org
Wed Dec 6 23:24:58 EST 2000
> Should the last line of text have an implicit Eol? This test
>assumes it should, but the test failed. A test that's identical,
>except that the target text ends with a newline, passed.
The expression:
> exp1 = Martel.ToEol()
> exp2 = Martel.ToEol()
> exp3 = Martel.ToEol()
> expression = exp1 + exp2 + exp3
requires a final newline. It's possible to write an expression
which doesn't need that, as with
exp3 = Martel.Re(r"[^\R]*\R?")
As written, it is hard in Martel to make the ToEol expression
automatically recognize that a final newline is not needed. It
could be written as
[^\R]*(\R|$)
assuming that $ was changed to mean "end of text" rather than
end of line as I believe it does now. (I mentioned yesterday
that I don't like the ^ and $ assertions.)
Instead, it is easier (not necessarily better!) if the format
author defines the last line to have an optional \R.
Still, complications arise from interactions with the record
readers. They read a record at a time and pass the string
over to the parser. The '$' will match at the end of that
string even though in the full format (non-record reader based)
it would not have matched.
After a bit of thought I realize that's a knee-jerk reaction.
That isn't a big concern since there are similar problems
already. For example, if the record parser uses "(.|\n)*" it
will read up to the end of the record, but in the full format
would read the whole file.
Another solution is to have a specialzed ToEol (either a
new function or an optional argument) which generates the
"\R?" form.
Finally, I don't think this is much of an issue for real
formats. All the ones I've tested so far have a final newline,
although I don't expect that to always be the case. In
addition, the last line is usually well defined so a ToEol
(special or otherwise) isn't needed. Eg, it can be defined
with Re(r"///\R?") or Re(r"END\R?").
I'll point out that the record readers are designed so that
a final newline is not needed for the record. Thus, any
problems with a missing newline should be completely handleable
by an appropriate format definition.
Andrew
dalke at acm.org
More information about the Biopython-dev
mailing list