[Biopython-dev] [Biopython] SeqIO.index improvement suggestions

Renato Alves rjalves at igc.gulbenkian.pt
Wed Feb 24 11:07:01 EST 2010


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Quoting Peter on 02/24/2010 11:52 AM:

> Maybe this mode isn't fully supported in gzip? I think that provided we
> assume that any gzipped text file will use Unix new lines, we don't need
> to worry about this.

Your example puzzled me. I did a few more tests with the files you
pointed out. Turns out that the fastq file is 'badly' read even on
normal open 'Universal' mode. This doesn't happen on the other files:

Python 2.6.4 [GCC 4.4.1] Linux

>>> open('example.fastq.gz', 'rb').read() == open('example.fastq.gz',
'rU').read()
False
>>> open('example.fasta.gz', 'rb').read() == open('example.fasta.gz',
'rU').read()
True
>>> open('example.qual.gz', 'rb').read() == open('example.qual.gz',
'rU').read()
True

In particular the character in fault seems to be:

>>> (open('example.fastq.gz', 'rb').read()[145],
open('example.fastq.gz', 'rU').read()[145])
('\r', '\n')

This is the only thing that changed.

After going a little over the content of the file, I found this workaround:

$ gunzip example.fastq.gz && echo >> example.fastq && gzip example.fastq

Which simply adds a new empty line to the end of the file.

>>> open('example.fastq.gz', 'rb').read() == open('example.fastq.gz',
'rU').read()
True

After this I also looked into python3 (3.1.1) just in case they fixed it
already and apparently they did. See for yourself:

This was tested in Python-3.1.1 from within blender2.5, (apologies for
that, it was the only python3 version I had around).

>>> open('example.fastq.gz','rb').read() ==
open('example.fastq.gz','rU').read()
Traceback (most recent call last):
(...)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 1:
unexpected code byte

Seems like I need to force binary mode...

>>> open('example.fastq.gz','rb').read() ==
open('example.fastq.gz','rbU').read()
True

Success!

>>> import gzip
>>> gzip.open('example.fastq.gz','rb').read()
b'@EAS54_6_R1_2_1_413_324\nCCCTTCTTGTCTTCAGCGTTTCTCC\n+\n;;3;;;;;;;;;;;;7;;;;;;;88\n at EAS54_6_R1_2_1_540_792\nTTGGCAGGCCAAGGCCGATGGATCA\n+\n;;;;;;;;;;;7;;;;;-;;;3;83\n at EAS54_6_R1_2_1_443_348\nGTTGCTTCTGGCGTGGGTGGGGGGG\n+\n;;;;;;;;;;;9;7;;.7;393333\n'

>>> gzip.open('example.fastq.gz','rU').read()
b'@EAS54_6_R1_2_1_413_324\nCCCTTCTTGTCTTCAGCGTTTCTCC\n+\n;;3;;;;;;;;;;;;7;;;;;;;88\n at EAS54_6_R1_2_1_540_792\nTTGGCAGGCCAAGGCCGATGGATCA\n+\n;;;;;;;;;;;7;;;;;-;;;3;83\n at EAS54_6_R1_2_1_443_348\nGTTGCTTCTGGCGTGGGTGGGGGGG\n+\n;;;;;;;;;;;9;7;;.7;393333\n'

>>> gzip.open('example.fastq.gz','rbU').read()
b'@EAS54_6_R1_2_1_413_324\nCCCTTCTTGTCTTCAGCGTTTCTCC\n+\n;;3;;;;;;;;;;;;7;;;;;;;88\n at EAS54_6_R1_2_1_540_792\nTTGGCAGGCCAAGGCCGATGGATCA\n+\n;;;;;;;;;;;7;;;;;-;;;3;83\n at EAS54_6_R1_2_1_443_348\nGTTGCTTCTGGCGTGGGTGGGGGGG\n+\n;;;;;;;;;;;9;7;;.7;393333\n'

And everything works as expected.

So unless the blender devs changed python to fix this bug, this has been
fixed in python3.

Should this go upstream?

- --
Renato
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkuFTqAACgkQYh11EUYTX9TXbgCgmBDKrrjL6Eue8qRfgs2ydAUQ
11kAnR0beVQDLP4ldBcd2RFfJ5Q+Opo6
=MLu3
-----END PGP SIGNATURE-----


More information about the Biopython-dev mailing list