[Biopython-dev] [Biopython] SeqIO.index improvement suggestions

Peter biopython at maubp.freeserve.co.uk
Tue Dec 22 16:08:50 UTC 2009


On Tue, Dec 22, 2009 at 3:34 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>
> Points to note, this is untested on Windows. In particular we need
> to look at gzipped plain text files using DOS/Windows new lines
> (rare case?) plus gzipped plain text files using Unix new lines
> (likely to be the more common of the two I'd expect). From my
> initial checks, while gzip.open() does take a mode argument it
> doesn't seem to support the "rU" value for universal new line
> read mode. This spoils my plan to give the open_function both
> the filename and the desired mode (generally "rU", but for SFF
> files etc we will want to use "rb").

The gzip mode issue is interesting... running on the Mac,
Leopard 10.5, using the Apple provided Python 2.5.2,
looking at a gzipped QUAL file everything is fine:

Python 2.5.2 (r252:60911, Feb 22 2008, 07:57:53)
[GCC 4.0.1 (Apple Computer, Inc. build 5363)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import gzip
>>> gzip.open("Quality/example.qual.gz", "r").read()
'>EAS54_6_R1_2_1_413_324\n26 26 18 26 26 26 26 26 26 26 26 26 26 26 26
22 26 26 26 26\n26 26 26 23 23\n>EAS54_6_R1_2_1_540_792\n26 26 26 26
26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26\n26 18 26 23
18\n>EAS54_6_R1_2_1_443_348\n26 26 26 26 26 26 26 26 26 26 26 24 26 22
26 26 13 22 26 18\n24 18 18 18 18\n'
>>> gzip.open("Quality/example.qual.gz", "rb").read()
'>EAS54_6_R1_2_1_413_324\n26 26 18 26 26 26 26 26 26 26 26 26 26 26 26
22 26 26 26 26\n26 26 26 23 23\n>EAS54_6_R1_2_1_540_792\n26 26 26 26
26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26\n26 18 26 23
18\n>EAS54_6_R1_2_1_443_348\n26 26 26 26 26 26 26 26 26 26 26 24 26 22
26 26 13 22 26 18\n24 18 18 18 18\n'
>>> gzip.open("Quality/example.qual.gz", "rU").read()
'>EAS54_6_R1_2_1_413_324\n26 26 18 26 26 26 26 26 26 26 26 26 26 26 26
22 26 26 26 26\n26 26 26 23 23\n>EAS54_6_R1_2_1_540_792\n26 26 26 26
26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26\n26 18 26 23
18\n>EAS54_6_R1_2_1_443_348\n26 26 26 26 26 26 26 26 26 26 26 24 26 22
26 26 13 22 26 18\n24 18 18 18 18\n'

Looking at a gzipped FASTA file everything is fine:

>>> gzip.open("Quality/example.fasta.gz", "r").read()
'>EAS54_6_R1_2_1_413_324\nCCCTTCTTGTCTTCAGCGTTTCTCC\n>EAS54_6_R1_2_1_540_792\nTTGGCAGGCCAAGGCCGATGGATCA\n>EAS54_6_R1_2_1_443_348\nGTTGCTTCTGGCGTGGGTGGGGGGG\n'
>>> gzip.open("Quality/example.fasta.gz", "rb").read()
'>EAS54_6_R1_2_1_413_324\nCCCTTCTTGTCTTCAGCGTTTCTCC\n>EAS54_6_R1_2_1_540_792\nTTGGCAGGCCAAGGCCGATGGATCA\n>EAS54_6_R1_2_1_443_348\nGTTGCTTCTGGCGTGGGTGGGGGGG\n'
>>> gzip.open("Quality/example.fasta.gz", "rU").read()
'>EAS54_6_R1_2_1_413_324\nCCCTTCTTGTCTTCAGCGTTTCTCC\n>EAS54_6_R1_2_1_540_792\nTTGGCAGGCCAAGGCCGATGGATCA\n>EAS54_6_R1_2_1_443_348\nGTTGCTTCTGGCGTGGGTGGGGGGG\n'

But, there is a problem with my gzipped FASTQ file:

>>> gzip.open("Quality/example.fastq.gz", "r").read()
'@EAS54_6_R1_2_1_413_324\nCCCTTCTTGTCTTCAGCGTTTCTCC\n+\n;;3;;;;;;;;;;;;7;;;;;;;88\n at EAS54_6_R1_2_1_540_792\nTTGGCAGGCCAAGGCCGATGGATCA\n+\n;;;;;;;;;;;7;;;;;-;;;3;83\n at EAS54_6_R1_2_1_443_348\nGTTGCTTCTGGCGTGGGTGGGGGGG\n+\n;;;;;;;;;;;9;7;;.7;393333\n'
>>> gzip.open("Quality/example.fastq.gz", "rb").read()
'@EAS54_6_R1_2_1_413_324\nCCCTTCTTGTCTTCAGCGTTTCTCC\n+\n;;3;;;;;;;;;;;;7;;;;;;;88\n at EAS54_6_R1_2_1_540_792\nTTGGCAGGCCAAGGCCGATGGATCA\n+\n;;;;;;;;;;;7;;;;;-;;;3;83\n at EAS54_6_R1_2_1_443_348\nGTTGCTTCTGGCGTGGGTGGGGGGG\n+\n;;;;;;;;;;;9;7;;.7;393333\n'
>>> gzip.open("Quality/example.fastq.gz", "rU").read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/gzip.py",
line 220, in read
    self._read(readsize)
  File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/gzip.py",
line 292, in _read
    self._read_eof()
  File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/gzip.py",
line 311, in _read_eof
    raise IOError, "CRC check failed"
IOError: CRC check failed

I may have stumbled on a bug in the Python gzip library :(

Peter



More information about the Biopython-dev mailing list