[BioPython] Whitespace in sequences

Brad Chapman chapmanb at arches.uga.edu
Wed Feb 19 18:25:25 EST 2003


Hey Paul-Michael, Iddo;
Thirteen hour time difference here from EST, so we are just waking
up (and man, does my head hurt; no comment).

Paul-Michael;
 > > While recently writing a biopython script to extract subsequences 
from
 > > a fasta file, I was surprised to find that whitespace was retained
 > > within the sequence after it was read into a SeqRecord. 
Specifically,
 > > carriage returns ('\r') were left embedded in the sequence, which 
then
 > > made the sequence lengths inaccurate and meant I extracted the wrong
 > > regions.

 > I guess you were using biopython on a Mac/Windows box, where '\r' or
 > '\r\n' is a
 > newline.

The problem is likely that you are using Mac generated files (which
have '\r\n' as the line separator character on a non-Mac box. In
python '\n' in a string translates to the native file ending, but if
you are using non-native file endings then all is up in the the air.

The simple solution to this problem, which is really the solution to
all cross-platform line ending problems in my opinion, is to
transfer the files to your box "correctly." If you transfer the
files with an ftp program as text (or as a number of ssh-based
programs as text) then the line endings should be fixed and all
should be good.

Different line endings suck and are always a pain, but this is
really the simplest way to make your life less of a hassel.

 > Also, it looks like you were using the Bio.Fasta package to
 > read... the bug shouldn't occur within Bio.SeqIO.FASTA.FastaReader
 > (although it will within SeqIO.FASTA.FastaWriter!)

Yes, Martel-based parsers handle this correctly as they have a
AnyEol() function which matches any possible end of line character.
So conceivably you can have files with different line endings
throughout and all will be good.

 > Basically, all occurences of the Linux/Unix-centric '\n' should be
 > replaced with os.linesep. In all modules.
 > (a few minutes later)

As I mentioned, '\n' isn't really unix-centric. It does mean '\r\n'
on Macs and '\r' on Windows machines. Again, line endings suck.


 > Hmmm... sorry, but I can't seem to commit the bugfix, probably 
something
 > to do with snow in Boston, or a Hackathon in Singapore. Take your 
pick. :)

Seems you got it Iddo, but pub.open-bio.org is the new dev server --
convenient crashing sidelined the other machine (although maybe it
is fixed now since your fixes did go through).

Did-I-mention-line-endings-suck-ly yr's,
Brad



More information about the BioPython mailing list