[Biopython-dev] Reading sequences: FormatIO, SeqIO, etc

Tue Aug 1 20:53:08 UTC 2006

Peter wrote:
>>> The SeqUtils/quick_FASTA_reader is interesting in that it loads the
>>> entire file into memory in one go, and then parses it.  On the other
>>> hand its not perfect: I would use "\n>" as the split marker  
>>> rather than ">" which could appear in the description of a sequence.

Leighton Pritchard replied:
>> I agree (not that it's bitten me, yet), but I'd be inclined to go  
>> with "%s>" % os.linesep as the split marker, just in case.

Peter then wrote:
> Good point.

I take that back - I was right the first time ;)

You are right to worry about the line sep changing from platform to
platform, but you shouldn't use "%s>" % os.linesep

However, when reading windows style files on windows, the newlines
appear in python as just \n (as do newlines from unix files read on
windows).

When writing text files on windows, again \n gets turned into CR LF on
the disk.

Just using "\n>" would work on any platform reading a FASTA file with
the expected newlines.  As a bonus it would work on Windows when reading
unix style newlines.

To get any platform to read newlines from any other platform what I
suggest is using "\n>" as the split string, but open the file in
universal text mode - this seems to work fine on Python 2.3, but I'm not
sure when universal newline reading was introduced.

For example, I created a simple file using the three newline conventions
(using the TextPad on Windows).

>>> import sys
>>> sys.platform
'win32'
>>> os.linesep
'\r\n'

>>> open("c:/temp/windows.txt","r").read()
'line\nline\n'
>>> open("c:/temp/mac.txt","r").read()
'line\rline\r'
>>> open("c:/temp/unix.txt","r").read()
'line\nline\n'

(Notice that using "\n>" wouldn't work when reading a Mac style file on
Windows)

>>> open("c:/temp/windows.txt","rU").read()
'line\nline\n'
>>> open("c:/temp/mac.txt","rU").read()
'line\nline\n'
>>> open("c:/temp/unix.txt","rU").read()
'line\nline\n'

Peter