solving the newline problem (was Re: [Biopython-dev] Martel-0.3 available)

Sun Nov 19 04:21:02 EST 2000

Let me restore context first.  The question was how to handle different
newline conventions, where native text files on the Mac use '\015', on
unix use '\012' and MS use '\012\015'.  This convention is hidden somewhat
behind the C file I/O layer.  In text mode it translates the local
newline convention to the single character '\n', which in ASCII is
'\012'.  In binary mode the input character stream is not modified.

Martel uses '\n' as the end of line character and converts it to chr(10).
This requires the input be in ASCII, which is a good assumption.  (I
don't expect to run Martel under an IBM 370 any time soon - that being
an EBCDIC machine :)

This means Martel should be able to run under an OS so long as the input
text data has been converted to use the local machine's line ending
convention and the file was opened in text mode, which is the default.
For example, ftps must be done in ASCII mode instead of binary.

Networks make things more complicated.  For example, an http connection
only supports the binary mode of ftp meaning there is no way to negotiate
local newline conventions and automatically convert as needed.  Similarly,
files shared over NFS or SMB are not automatically converted.  (Samba
does have a flag to allow for automatic conversion, but I don't believe
it is used very often.)

On top of that, people are well known for being human - they are
inconsistent.  I had considered a wrapper which would read the first few
characters of a file to determine the newline convention and convert as
needed.  Some time ago Brad pointed out:
> There are times where people have generated files like this in my lab
> (the sequencer is running Windows, but they like to play around on
> the files on a Mac -- I still don't know how they got a mix of line
> breaks -- I think by cutting and pasting between files with different
> line breaks).

As another case, Roger Sayle pointed out to me yesterday that some of the
data files are made by concatenating other files.  For example, by merging
the gbpri* from GenBank into one file.  Suppose some of those files were
downloaded via FTP in ASCII mode and some in binary.  Then the newline
convention changes throughout the merged file.  Since this does happen,
it would be nice to handle this case gracefully.

Earlier I had outlined a few ways to solve the problem:
>    1) require the input to be converted to the local line ending and
>  provide no support for doing so

Not graceful.  No one likes this solution.

>    2) supply some adapters ("FromMac", "FromUnix", "FromDos") but don't
>  use them; instead leaving the decision up to the client code

As proposed, this wouldn't work because the line convention can change.
Instead, it would need to be a "FromAny" which would allow any of the
three endings.

>    3) provide a tool which autodetects endings and uses the right
>  adapter

My original thought was to read up to the first newline and use that
convention for the rest of the file.  This would not work.  Instead,
the "FromAny" converter would always have to check for all three endings.

>    4) http://members.nbci.com/_XOOM/meowing/python/index.html

I mentioned this library for two reasons.  First, I had heard it was
faster than Python's readline() method.  This is true, but it is almost
exactly as fast as Python's readlines(), which I had been using so it
offers no performance benefit.

Second, I thought it allowed having all three of "\n", "\r" and "\r\n"
as the newline character.  After investigation I found out that it doesn't.
You can change the end-of-line marker but it must still be a single string.

It turns out that mxTextTools has a linesplit function which takes a
string and converts it into newlines - allowing any of the three
conventions.
As it is written it is not appropriate for Martel because it strips out the
newlines.  A guarantee in Martel's design is that it must send all the
characters to the ContentHandler's "characters" method so they may be
counted.  This allows indexing by just counting the number of characters
which have gone through.  If the end of line characters are discarded,
it is impossible to know if the tossed text was one character or two.

The mxTextTools function is very easy to implement and this deficiency
is readily remedied.

(An alternate solution is to add a tell method to the parser which
gets mapped down to the file handle's tell method.  This is a problem
because the readers are free to read ahead many characters for faster
reading.  When tell is called, it would have to figure out where the
callback is in the parsing.  This is complicated even more by text mode
file handles on MS where tell works correctly and increased by two for
"\r\n" even though only a single character is returned.)

>    5) define an EOL = Re(r"\n|\r\n?")
>
>  I don't like 5 because people will forget to use it.

Brad liked it because:
> 1. Easy to implement, and isn't very likely to break :-).
>
> 2. Provided the regexp would recognize Mac line breaks (hmmm, I'm not
> positive what those look like) then this could deal with files with
> multiple different types of line breaks without whining.

I ended up having a more serious problem with this option.  Martel allows
what I call "RecordReaders" that are really two parsers in one.  The first
does a simple scan of the input stream to identify records and the second
parses the records into SAX events.  Together they create the same SAX
events as a standard parser but use much less memory.  (They only need
enough memory to parse the most complex record, while the standard parsers
parse the whole file at once to need roughly 10 times as much RAM as the
input data.)

The input data files are line oriented so my RecordReaders used the file's
"readlines" method with a sizehint to read a large but memory-bounded
number of lines, then scanned those lines to identify the records.  The
lines are joined back together into one string and parsed with the second
stage parser.

This makes file reading about as fast as you can do with native Python.
However, readlines uses the local platform's definition of newline and
there is no way to support all three conventions.  If I had a Mac text
file, which uses '\013', and tried to read a line under unix, I would get
everything in the file as one line since there is no '\010' in the file.

So I'm left with the conclusion that I need to write a specialized reader
which understands all three line conventions, rather like the 'FromAny'
mentioned above.  Unlike mxTextTool's linesplit function it would need
to keep the end-of-line identifier.  Unlike my RecordReaders, it couldn't
use the readline or readlines methods but would have to call read directly.

Here's how the data would go through the system.  Create a file object
(open a file, use urllib to create a socket connection or use a StringIO).
Wrap it inside a FromAny object, which uses the file's read() method to
implement its own readlines() method, which supports the different newline
conventions.  The RecordReader uses those lines to find the records then
merges them back into one string for the record parser.

Very complicated, with lots of pure Python code to make things slow.
Hence, I didn't like it.

As I was looking through the QIO code I came up with an idea, which I
think ultimately arises from the bioperl list.  Bioperl's FASTA parser
works by defining $/ (the line separator) to "\n>".  This pushes the
problem of record identification to Perl and quite simplifies the read
loop.  The QIO interface would allow the same simplification, so
searching for a SWISS-PROT record could be turned into looking for the
string "\nID   ".

QIO doesn't support all three endings.  I could modify the code, but
then that would require (yet) another C extension.  We're already
including mxTextTools, which does text processing - why not use it?
That's when I dug through the module and found the 'linesplit' function,
which is written in pure Python using the taglist.

I hacked together some test code to try it out.  It is attached.  It
parses SWISS-PROT records by looking for lines matching "//" followed
by "\n", "\r\n" or "\r" and using them as end of record indicators.  After
some tweaking of the tatable to remove a subtable call, I found out it
was 15% *faster* than the readlines code.  (I haven't yet tested it on
MS to ensure it handles both text and binary reads, but it should. :)

It works on a large block of text at a time rather than splitting them
apart into lines.  The record parser uses a single block of text so
the current RecordReaders need to string.join the lines back into a
block.  This new approach only needs to use a single subslice to get
that text, so overall it should be a bit faster still.

WHAT DOES THIS GET US?

This new approach makes record identification much faster and allows
the record readers to work on files containing a mix of any of the three
standard line encodings.  This means my objection to option 5 no longer
includes any objections based on parsing performance.

There are still some problems with usability.  In binary mode, or with
foreign text files, the parser can send back "\n", "\r\n" or "\r"
characters as newlines.  The format definition must support them.  The
format definition for newline is simply "\n" which is insufficient.

For example, suppose you just want to read the text of the DE line
in SWISS-PROT.  The current format definition might be:

DE = Group("DE", Re("DE   (?P<description>[^\n]*)\n"))

This would have to be replaced with

DE = Group("DE", Re("DE   (?P<description>[^\n\r]*)(\n|\r\n?)"))

There are two changes: one for "description" from [^\n] to [^\n\r]
and the other from \n to \n|\r\n? .

They are simple mechanical transformation but the need for them may
be sufficiently different from common use that it would be nice to
automate it or otherwise ignore their need.

I mentioned one possibility - define EOL = Re("\n|\r\n?").  Then the
DE format definition becomes:

DE = Group("DE", Re("DE   (?P<description>[^\n\r]*)") + EOL)

This is simpler to type and less error prone than using the full,
correct definition, but isn't as nice as "\n".  It isn't standard
so I think people will forget to put in the EOL in place of "\n".
Finally, it doesn't fix the need to use [^\n\r].

Here is a solution which appears to make the problem disappear.
If "\n" is ever found outside of a [] then replace it with "\n|\r\n?".
If it is ever found inside of a [], then also include "\r".

The problem is that it violates one of my basic design beliefs.  Things
which act different should not look the same.  Other regular expression
parsers do not support this conversion so I do not want to use it.

(Martel doesn't support backtracking inside of repeats.  You may
jusifiable call it a violation of this belief.  On the other hand, any
solution which works in Martel should work using a normal regular
expression engine, so the implementation is really a subset of existing
behaviour and not a new behaviour.)

Here's another possibility.  There are still some letters unused as escape
sequences in both Perl and Python.  What about defining \R to mean
"platform-independent newline character"?  When used outside of []s it
gets turned into "\n|\r\n?" and when used inside of []s is the same as
[\r\n].  I chose \R because \N in perl is used for "named char".

Its use would change the DE definition from
  DE = Group("DE", Re("DE   (?P<description>[^\n]*)\n"))
to
  DE = Group("DE", Re("DE   (?P<description>[^\R]*)\R"))

It is still a non-standard definition, which means it isn't as nice as
I would like for it to be.  However, I haven't found any other regular
expression grammer which supports alternate newline conventions so
there isn't really any standard to be standard to.

The only time it would be used is in the Martel definition.  Converting
the Martel expression back to a regular expression pattern would use the
"\n|\r\n?" or "[\r\n]" descriptions, so the expression itself is still
standard; the \R is simply a shorthand notation, like \n is itself shorthand
for \010.

The conversion is mechanical and is in most cases a simple text
substitution.  That makes it easy to use, although it's existance and
need would need to be carefully documented and enforced with social
pressure.  ("You *do* know that \n doesn't work as well as \R, right?")

In closing, I've come up with a way to increase parsing performance
and in a way which is platform independent and requires few changes in
people's understanding of regular expression syntax.  The first part
(increased performance) does not affect what I consider to be the stable
part of the API.  The second part does change things from their commonly
accepted use so I would like to hear any comments people may have about it.

                    Andrew
                    dalke at acm.org

P.S.
  In retrospect using mxTextTools for the record reading is obvious
and solves quite a few problems I was having.  I hate it when that happens
because it make me feel dim-witted.  After all, I've been thinking about
this problem for a long time - why was I stuck in the old solution?  But
that's the way things go :)

P.P.S. - and irrelevant
  FSU (my alma mater) beat Florida and will likely be ranked as the number
2 college team.  Miami's complaining that they'll be #3 after FSU even
though they beat FSU.  Of course, they forget '89 when FSU beat Miami
but Miami was ranked #1.

-------------- next part --------------
#  % python read2.py
#  80000 records found with readlines
#  Time for readlines 118.07604301
#  80000 records found with find_record_ends
#  Time for tagtables 100.489358068
#  %

from Martel import Generate
from mx import TextTools as TT

tagtable = (
    # Is the current line the end of record marker?
    (None, TT.Word, "//", +5, +1),

    # Make sure it ends the line
    ("end", TT.Is, '\n', +1, -1),  # matches '\n'
    (None, TT.Is, '\r', +3, +1),
    ("end", TT.Is, '\n', +1, -3),
    ("end", TT.Skip, 0, -4, -4),

    # Not the end of record marker, so read to the end of line
    (None, TT.AllInSet, TT.invset('\r\n'), +1, +1),

    # Check if EOF
    (None, TT.EOF, TT.Here, +1, TT.MatchOk),

    # Not EOF, so scarf any newlines
    (None, TT.AllInSet, TT.set('\r\n'), TT.MatchFail, -7),
    )

def find_record_ends(text):
    result, taglist, pos = TT.tag(text, tagtable)
    ends = []
    for tag in taglist:
        ends.append(tag[2])
    return ends

def test1():
    expect = (
        '//\n',
        'Andrew Dalke\n//\n',
        'was //\nhere\n//\n',
        '//\n'
        )
    text = "//\nAndrew Dalke\n//\nwas //\nhere\n//\n//\n"

    ends = find_record_ends(text)
    assert len(expect) == len(ends), (len(expect), len(ends))
    prev = 0
    for ex, end in map(None, expect, ends):
        s = text[prev:end]
        assert ex == s, (ex, s)
        prev = end
    print "expected lines found"

def test2():
    infile = open("/home/dalke/ftps/swissprot/sprot38.dat")
    s = ""
    count = 0
    while 1:
        data = infile.read(1000000)
        #print "Loop", count, len(s)
        if not data:
            break
        ends = find_record_ends(s+data)
        if not ends:
            s = data
            continue
        s = data[ends[-1]:]
        count = count + len(ends)
    assert not s, "still have data: %s" % repr(s[:200])
    print count, "records found with find_record_ends"

def test3():
    infile = open("/home/dalke/ftps/swissprot/sprot38.dat")
    count = 0
    while 1:
        lines = infile.readlines(1000000)
        if not lines:
            break
        #print "Loop", count
        for line in lines:
            if line == "//\n":
                count = count + 1
    print count, "records found with readlines"

def do_time():
    import time

    t1 = time.time()
    test3()
    t2 = time.time()
    print "Time for readlines", t2-t1

    t1 = time.time()
    test2()
    t2 = time.time()
    print "Time for tagtables", t2-t1

if __name__ == "__main__":
    #test1()
    #test2()
    do_time()