[Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk?

Fri Jun 4 09:16:19 UTC 2010

On Fri, Jun 4, 2010 at 9:49 AM, Leighton Pritchard <lpritc at scri.ac.uk> wrote:
> Hi,
>
> On 03/06/2010 Thursday, June 3, 18:52, "Peter"
> <biopython at maubp.freeserve.co.uk> wrote:
>
>> There are two major additions that have been discussed (and some
>> code written too): gzip support and storing the index on disk.
>
> [...]
>
>> Now ideally we'd be able to offer both of these features - but if
>> you had to vote, which would be most important and why?
>
> On-disk indexing.  But does this not also lend itself (perhaps
> eventually...) also to storing the whole dataset in SQLite or similar to
> avoid syncing problems between the file and the index?  Wasn't that also
> part of a discussion on the BIP list some time ago?

That is a much more complicated problem - serialising data from many
different possible files formats. We have BioSQL which is pretty good
for things like GenBank, EMBL, SwissProt etc but not suitable for FASTQ.
I'd rather stick to the simpler task of recording a lookup table mapping
record identifiers to file offsets.

> I've not looked at how you're already parsing from gzip files, so I hope
> it's more time-efficient than what I used to do for bzip, which was write a
> Pyrex wrapper to Flex, which was using the bzip2 library directly.  This was
> not a speed improvement over uncompressing the file each time I needed to
> open it (and then using Flex).  The same is true for Python's gzip module:
>
> -rw-r--r--  1 lpritc  staff   110M 14 Apr 14:22
> phytophthora_infestans_data.tar.gz
>
> $ time gunzip phytophthora_infestans_data.tar.gz
>
> real    0m18.359s
> user    0m3.562s
> sys    0m0.582s
>
> Python 2.6 (trunk:66714:66715M, Oct  1 2008, 18:36:04)
> [GCC 4.0.1 (Apple Computer, Inc. build 5370)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
>>>> import time
>>>> import gzip
>>>> def gzip_time():
> ...     t0 = time.time()
> ...     f = gzip.open('phytophthora_infestans_data.tar.gz','rb')
> ...     f.read()
> ...     print time.time()-t0
> ...
>>>> gzip_time()
> 19.2009749413
>
> If you know where your data is, it can be quicker to get to, but you still
> need to uncompress each time, and it scales approximately linearly with
> number of lines returned, as you'd expect:
>
>>>> def read_lines(n):
> ...     t0 = time.time()
> ...     f = gzip.open('phytophthora_infestans_data.tar.gz', 'rb')
> ...     lset = [f.readline() for i in range(n)]
> ...     print time.time() - t0
> ...     return lset
> ...
>>>> d = read_lines(1000)
> 0.0324518680573
>>>> d = read_lines(10000)
> 0.11150097847
>>>> d = read_lines(100000)
> 0.808992147446
>>>> d = read_lines(1000000)
> 7.9017291069
>>>> d = read_lines(2000000)
> 15.7361371517
>>>> d = read_lines(3000000)
> 23.7589659691
>
> The advantage to me was in the amount of disk space (and network transfer
> time/bandwidth) saved by dealing with a compressed file.  In the end I
> decided that, where data access was likely to be frequent, buying more
> storage and handling uncompressed data would be a better option than dealing
> directly with the compressed file:
>
> -rw-r--r--  1 lpritc  staff   410M 14 Apr 14:22
> phytophthora_infestans_data.tar
>
>>>> def read_file():
> ...     t0 = time.time()
> ...     d = open('phytophthora_infestans_data.tar','rb').read()
> ...     print time.time() - t0
> ...
>>>> read_file()
> 0.620229959488
>
>>>> def read_file_lines(n):
> ...     t0 = time.time()
> ...     f = open('phytophthora_infestans_data.1.tar', 'rb')
> ...     lset = [f.readline() for i in range(n)]
> ...     print time.time() - t0
> ...     return lset
> ...
>>>> d = read_file_lines(100)
> 0.000148057937622
>>>> d = read_file_lines(1000)
> 0.000863075256348
>>>> d = read_file_lines(10000)
> 0.00704002380371
>>>> d = read_file_lines(100000)
> 0.0780401229858
>>>> d = read_file_lines(1000000)
> 0.804203033447
>>>> d = read_file_lines(2000000)
> 1.71462202072
>>>> d = read_file_lines(4000000)
> 3.55472993851
>
>
> I don't see (though I'm happy to be shown) how you can efficiently index
> directly into the LZW/DEFLATE/BZIP compressed data.  If you're not
> decompressing the whole thing in one go, I think you atill have to partially
> decompress a section of the file (starting from the front of the file) to
> retrieve your sequence each time.  Even if you index - say, by recording the
> required buffer size/number of buffer decompressions and the offset of your
> sequence in the output as the index.  This could save memory if you discard,
> rather than cache, unwanted early output - but I'm not sure that it would be
> time-efficient to do it for more than one or two (on average) sequences in a
> compressed file.  You'd likely be better off spending your time waiting for
> the file to decompress once and doing science with the time that's left over
> ;)
>
> I could be wrong, though...
>

The proof of concept support for gzip files in Bio.SeqIO.index() just called
the Python gzip module. This gives us a file-like handle object supporting
the usual methods like readline and iteration (used to scan the file looking
for each record) and seek/tell (offsets for the decompressed stream).
Here building the index must by its nature decompress the whole file once
- there is no way round that. The interesting thing is how seeking to an
offset and then reading a record performs - and I have not looked at the
run time or memory usage for this. It works, but your measurements do
suggest it will be much much slower than using the original file.

i.e. It looks like while the code to support gzip files in Bio.SeqIO.index()
is quite short, the performance may be unimpressive for large archives.
I doubt this can be worked around - its the cost of saving disk space by
compressing a whole file without taking any special case about putting
different records into different block.

Peter

[As an aside, this is something I'm interested in for BAM file support,
these are binary files which are gzip compressed.]