[Biopython-dev] Eol

Andrew Dalke dalke at acm.org
Wed Dec 6 23:24:58 EST 2000


>  Should the last line of text have an implicit Eol?  This test
>assumes it should, but the test failed.  A test that's identical,
>except that the target text ends with a newline, passed.

The expression:
>        exp1 = Martel.ToEol()
>        exp2 = Martel.ToEol()
>        exp3 = Martel.ToEol()
>        expression = exp1 + exp2 + exp3

requires a final newline.  It's possible to write an expression
which doesn't need that, as with

  exp3 = Martel.Re(r"[^\R]*\R?")

As written, it is hard in Martel to make the ToEol expression
automatically recognize that a final newline is not needed.  It
could be written as
    [^\R]*(\R|$)
assuming that $ was changed to mean "end of text" rather than
end of line as I believe it does now.  (I mentioned yesterday
that I don't like the ^ and $ assertions.)

Instead, it is easier (not necessarily better!) if the format
author defines the last line to have an optional \R.

Still, complications arise from interactions with the record
readers.  They read a record at a time and pass the string
over to the parser.  The '$' will match at the end of that
string even though in the full format (non-record reader based)
it would not have matched.

After a bit of thought I realize that's a knee-jerk reaction.
That isn't a big concern since there are similar problems
already.  For example, if the record parser uses "(.|\n)*" it
will read up to the end of the record, but in the full format
would read the whole file.

Another solution is to have a specialzed ToEol (either a
new function or an optional argument) which generates the
"\R?" form.

Finally, I don't think this is much of an issue for real
formats.  All the ones I've tested so far have a final newline,
although I don't expect that to always be the case.  In
addition, the last line is usually well defined so a ToEol
(special or otherwise) isn't needed.  Eg, it can be defined
with Re(r"///\R?") or Re(r"END\R?").

I'll point out that the record readers are designed so that
a final newline is not needed for the record.  Thus, any
problems with a missing newline should be completely handleable
by an appropriate format definition.

                    Andrew
                    dalke at acm.org





More information about the Biopython-dev mailing list