solving the newline problem (was Re: [Biopython-dev] Martel-0.3 available)

Mon Nov 20 01:01:10 EST 2000

Me:
>It works on a large block of text at a time rather than splitting them
>apart into lines.  The record parser uses a single block of text so
>the current RecordReaders need to string.join the lines back into a
>block.  This new approach only needs to use a single subslice to get
>that text, so overall it should be a bit faster still.

I've got a first pass at replacing the StartsWith RecordReader.  The
old reader (readlines and string.join) takes about 160 seconds to read
sprot38.dat while the new one takes about 90 seconds.  I also checked
and they return identical results.

>Here's another possibility.  There are still some letters unused as escape
>sequences in both Perl and Python.  What about defining \R to mean
>"platform-independent newline character"?  When used outside of []s it
>gets turned into "\n|\r\n?" and when used inside of []s is the same as
>[\r\n].  I chose \R because \N in perl is used for "named char".

I've got a first pass at this as well.  sre_parse.py is very clean code
to modify.  The result seems to pass my regression tests.  Still need to
try it against real data on a non-unix platform.

But that's all for the next day or so since I've got to get back to
paying work now.

                    Andrew
                    dalke at acm.org