solving the newline problem (was Re: [Biopython-dev] Martel-0.3 available)
Andrew Dalke
dalke at acm.org
Mon Nov 20 01:01:10 EST 2000
Me:
>It works on a large block of text at a time rather than splitting them
>apart into lines. The record parser uses a single block of text so
>the current RecordReaders need to string.join the lines back into a
>block. This new approach only needs to use a single subslice to get
>that text, so overall it should be a bit faster still.
I've got a first pass at replacing the StartsWith RecordReader. The
old reader (readlines and string.join) takes about 160 seconds to read
sprot38.dat while the new one takes about 90 seconds. I also checked
and they return identical results.
>Here's another possibility. There are still some letters unused as escape
>sequences in both Perl and Python. What about defining \R to mean
>"platform-independent newline character"? When used outside of []s it
>gets turned into "\n|\r\n?" and when used inside of []s is the same as
>[\r\n]. I chose \R because \N in perl is used for "named char".
I've got a first pass at this as well. sre_parse.py is very clean code
to modify. The result seems to pass my regression tests. Still need to
try it against real data on a non-unix platform.
But that's all for the next day or so since I've got to get back to
paying work now.
Andrew
dalke at acm.org
More information about the Biopython-dev
mailing list