[Biopython-dev] [Biopython] SeqIO.index improvement suggestions
Peter
biopython at maubp.freeserve.co.uk
Tue Dec 22 16:08:50 UTC 2009
On Tue, Dec 22, 2009 at 3:34 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>
> Points to note, this is untested on Windows. In particular we need
> to look at gzipped plain text files using DOS/Windows new lines
> (rare case?) plus gzipped plain text files using Unix new lines
> (likely to be the more common of the two I'd expect). From my
> initial checks, while gzip.open() does take a mode argument it
> doesn't seem to support the "rU" value for universal new line
> read mode. This spoils my plan to give the open_function both
> the filename and the desired mode (generally "rU", but for SFF
> files etc we will want to use "rb").
The gzip mode issue is interesting... running on the Mac,
Leopard 10.5, using the Apple provided Python 2.5.2,
looking at a gzipped QUAL file everything is fine:
Python 2.5.2 (r252:60911, Feb 22 2008, 07:57:53)
[GCC 4.0.1 (Apple Computer, Inc. build 5363)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import gzip
>>> gzip.open("Quality/example.qual.gz", "r").read()
'>EAS54_6_R1_2_1_413_324\n26 26 18 26 26 26 26 26 26 26 26 26 26 26 26
22 26 26 26 26\n26 26 26 23 23\n>EAS54_6_R1_2_1_540_792\n26 26 26 26
26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26\n26 18 26 23
18\n>EAS54_6_R1_2_1_443_348\n26 26 26 26 26 26 26 26 26 26 26 24 26 22
26 26 13 22 26 18\n24 18 18 18 18\n'
>>> gzip.open("Quality/example.qual.gz", "rb").read()
'>EAS54_6_R1_2_1_413_324\n26 26 18 26 26 26 26 26 26 26 26 26 26 26 26
22 26 26 26 26\n26 26 26 23 23\n>EAS54_6_R1_2_1_540_792\n26 26 26 26
26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26\n26 18 26 23
18\n>EAS54_6_R1_2_1_443_348\n26 26 26 26 26 26 26 26 26 26 26 24 26 22
26 26 13 22 26 18\n24 18 18 18 18\n'
>>> gzip.open("Quality/example.qual.gz", "rU").read()
'>EAS54_6_R1_2_1_413_324\n26 26 18 26 26 26 26 26 26 26 26 26 26 26 26
22 26 26 26 26\n26 26 26 23 23\n>EAS54_6_R1_2_1_540_792\n26 26 26 26
26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26\n26 18 26 23
18\n>EAS54_6_R1_2_1_443_348\n26 26 26 26 26 26 26 26 26 26 26 24 26 22
26 26 13 22 26 18\n24 18 18 18 18\n'
Looking at a gzipped FASTA file everything is fine:
>>> gzip.open("Quality/example.fasta.gz", "r").read()
'>EAS54_6_R1_2_1_413_324\nCCCTTCTTGTCTTCAGCGTTTCTCC\n>EAS54_6_R1_2_1_540_792\nTTGGCAGGCCAAGGCCGATGGATCA\n>EAS54_6_R1_2_1_443_348\nGTTGCTTCTGGCGTGGGTGGGGGGG\n'
>>> gzip.open("Quality/example.fasta.gz", "rb").read()
'>EAS54_6_R1_2_1_413_324\nCCCTTCTTGTCTTCAGCGTTTCTCC\n>EAS54_6_R1_2_1_540_792\nTTGGCAGGCCAAGGCCGATGGATCA\n>EAS54_6_R1_2_1_443_348\nGTTGCTTCTGGCGTGGGTGGGGGGG\n'
>>> gzip.open("Quality/example.fasta.gz", "rU").read()
'>EAS54_6_R1_2_1_413_324\nCCCTTCTTGTCTTCAGCGTTTCTCC\n>EAS54_6_R1_2_1_540_792\nTTGGCAGGCCAAGGCCGATGGATCA\n>EAS54_6_R1_2_1_443_348\nGTTGCTTCTGGCGTGGGTGGGGGGG\n'
But, there is a problem with my gzipped FASTQ file:
>>> gzip.open("Quality/example.fastq.gz", "r").read()
'@EAS54_6_R1_2_1_413_324\nCCCTTCTTGTCTTCAGCGTTTCTCC\n+\n;;3;;;;;;;;;;;;7;;;;;;;88\n at EAS54_6_R1_2_1_540_792\nTTGGCAGGCCAAGGCCGATGGATCA\n+\n;;;;;;;;;;;7;;;;;-;;;3;83\n at EAS54_6_R1_2_1_443_348\nGTTGCTTCTGGCGTGGGTGGGGGGG\n+\n;;;;;;;;;;;9;7;;.7;393333\n'
>>> gzip.open("Quality/example.fastq.gz", "rb").read()
'@EAS54_6_R1_2_1_413_324\nCCCTTCTTGTCTTCAGCGTTTCTCC\n+\n;;3;;;;;;;;;;;;7;;;;;;;88\n at EAS54_6_R1_2_1_540_792\nTTGGCAGGCCAAGGCCGATGGATCA\n+\n;;;;;;;;;;;7;;;;;-;;;3;83\n at EAS54_6_R1_2_1_443_348\nGTTGCTTCTGGCGTGGGTGGGGGGG\n+\n;;;;;;;;;;;9;7;;.7;393333\n'
>>> gzip.open("Quality/example.fastq.gz", "rU").read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/gzip.py",
line 220, in read
self._read(readsize)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/gzip.py",
line 292, in _read
self._read_eof()
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/gzip.py",
line 311, in _read_eof
raise IOError, "CRC check failed"
IOError: CRC check failed
I may have stumbled on a bug in the Python gzip library :(
Peter
More information about the Biopython-dev
mailing list