[Biopython-dev] [Biopython] SeqIO.index improvement suggestions
Peter
biopython at maubp.freeserve.co.uk
Wed Feb 24 06:52:55 EST 2010
On Tue, Dec 22, 2009 at 4:08 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>
> The gzip mode issue is interesting... running on the Mac,
> Leopard 10.5, using the Apple provided Python 2.5.2,
> looking at a gzipped QUAL file everything is fine:
>
> Python 2.5.2 (r252:60911, Feb 22 2008, 07:57:53)
> [GCC 4.0.1 (Apple Computer, Inc. build 5363)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
>>>> import gzip
>>>> gzip.open("Quality/example.qual.gz", "r").read()
> ...
>
> Looking at a gzipped FASTA file everything is fine:
> ...
>
> But, there is a problem with my gzipped FASTQ file:
>
>>>> gzip.open("Quality/example.fastq.gz", "r").read()
> '@EAS54_6_R1_2_1_413_324\nCCCTTCTTGTCTTCAGCGTTTCTCC\n+\n;;3;;;;;;;;;;;;7;;;;;;;88\n at EAS54_6_R1_2_1_540_792\nTTGGCAGGCCAAGGCCGATGGATCA\n+\n;;;;;;;;;;;7;;;;;-;;;3;83\n at EAS54_6_R1_2_1_443_348\nGTTGCTTCTGGCGTGGGTGGGGGGG\n+\n;;;;;;;;;;;9;7;;.7;393333\n'
>>>> gzip.open("Quality/example.fastq.gz", "rb").read()
> '@EAS54_6_R1_2_1_413_324\nCCCTTCTTGTCTTCAGCGTTTCTCC\n+\n;;3;;;;;;;;;;;;7;;;;;;;88\n at EAS54_6_R1_2_1_540_792\nTTGGCAGGCCAAGGCCGATGGATCA\n+\n;;;;;;;;;;;7;;;;;-;;;3;83\n at EAS54_6_R1_2_1_443_348\nGTTGCTTCTGGCGTGGGTGGGGGGG\n+\n;;;;;;;;;;;9;7;;.7;393333\n'
>>>> gzip.open("Quality/example.fastq.gz", "rU").read()
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/gzip.py",
> line 220, in read
> self._read(readsize)
> File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/gzip.py",
> line 292, in _read
> self._read_eof()
> File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/gzip.py",
> line 311, in _read_eof
> raise IOError, "CRC check failed"
> IOError: CRC check failed
>
> I may have stumbled on a bug in the Python gzip library :(
>
Prompted by a thread on the BioPerl mailing list, I revisited this issue:
http://lists.open-bio.org/pipermail/bioperl-l/2010-February/032359.html
>From some cross platform testing, I always seem to get the CRC error
when trying to open this gzipped FASTQ file in universal read lines mode.
The FASTA and QUAL file seem fine.
According to the gzip python module's documentation, it uses the zlib
module, and you can find the underlying version number like this:
>>> import zlib
>>> zlib.ZLIB_VERSION
'1.2.3'
Results from some testing the simple examples above (using Python
and the gzip module only):
[1] Mac OS X 10.5, Python 2.5.2, GCC 4.0.1, zlib 1.2.3 - fails
[2] Linux, Python 2.4.3, GCC 3.4.5, zlib 1.2.1.2 - fails
[3] Linux, Python 2.3.4, GCC 3.4.6, zlib 1.2.1.2 - fails
[3] Linux, Python 2.6.1, GCC 3.4.6, zlib 1.2.1.2 - fails
[4] Linux, Python 2.4.3, GCC 4.1.2, zlib 1.2.3 - fails
[4] Linux, Python 2.6.1, GCC 3.4.6, zlib 1.2.3 - fails
[5] Windows XP 32bit, Python 2.7a1, MSC v.1500, zlib 1.2.3 - fails
[5] Windows XP 32bit, Python 2.6, MSC v.1500, zlib 1.2.3 - fails
[5] Windows XP 32bit, Python 2.5.2, MSC v.1310, zlib 1.2.3 - fails
[5] Windows XP 32bit, Python 2.4.4, MSC v.1310, zlib 1.2.3 - fails
[5] Windows XP 32bit, Python 2.3.5, MSC v.1200, zlib 1.1.4 - fails
[1] My mac, [2] Local server, [3] Cluster head, [4] Cluster node, [5]
My windows box
This tells me that the failure isn't OS specific, and isn't specific
to a particular
version of Python or zlib. Note that on the Mac and Linux machines where I
get the CRC failure in python, the command line tool gunzip can decompress
the files fine.
If anyone else wants to test this (to confirm I'm not missing anything
obvious), you can download the gzipped files from github here:
wget http://github.com/peterjc/biopython/raw/index-zip/Tests/Quality/example.qual.gz
wget http://github.com/peterjc/biopython/raw/index-zip/Tests/Quality/example.fasta.gz
wget http://github.com/peterjc/biopython/raw/index-zip/Tests/Quality/example.fastq.gz
Maybe this mode isn't fully supported in gzip? I think that provided we
assume that any gzipped text file will use Unix new lines, we don't need
to worry about this.
Peter
More information about the Biopython-dev
mailing list