[BioPython] Whitespace in sequences
Brad Chapman
chapmanb at arches.uga.edu
Wed Feb 19 18:25:25 EST 2003
Hey Paul-Michael, Iddo;
Thirteen hour time difference here from EST, so we are just waking
up (and man, does my head hurt; no comment).
Paul-Michael;
> > While recently writing a biopython script to extract subsequences
from
> > a fasta file, I was surprised to find that whitespace was retained
> > within the sequence after it was read into a SeqRecord.
Specifically,
> > carriage returns ('\r') were left embedded in the sequence, which
then
> > made the sequence lengths inaccurate and meant I extracted the wrong
> > regions.
> I guess you were using biopython on a Mac/Windows box, where '\r' or
> '\r\n' is a
> newline.
The problem is likely that you are using Mac generated files (which
have '\r\n' as the line separator character on a non-Mac box. In
python '\n' in a string translates to the native file ending, but if
you are using non-native file endings then all is up in the the air.
The simple solution to this problem, which is really the solution to
all cross-platform line ending problems in my opinion, is to
transfer the files to your box "correctly." If you transfer the
files with an ftp program as text (or as a number of ssh-based
programs as text) then the line endings should be fixed and all
should be good.
Different line endings suck and are always a pain, but this is
really the simplest way to make your life less of a hassel.
> Also, it looks like you were using the Bio.Fasta package to
> read... the bug shouldn't occur within Bio.SeqIO.FASTA.FastaReader
> (although it will within SeqIO.FASTA.FastaWriter!)
Yes, Martel-based parsers handle this correctly as they have a
AnyEol() function which matches any possible end of line character.
So conceivably you can have files with different line endings
throughout and all will be good.
> Basically, all occurences of the Linux/Unix-centric '\n' should be
> replaced with os.linesep. In all modules.
> (a few minutes later)
As I mentioned, '\n' isn't really unix-centric. It does mean '\r\n'
on Macs and '\r' on Windows machines. Again, line endings suck.
> Hmmm... sorry, but I can't seem to commit the bugfix, probably
something
> to do with snow in Boston, or a Hackathon in Singapore. Take your
pick. :)
Seems you got it Iddo, but pub.open-bio.org is the new dev server --
convenient crashing sidelined the other machine (although maybe it
is fixed now since your fixes did go through).
Did-I-mention-line-endings-suck-ly yr's,
Brad
More information about the BioPython
mailing list