[Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk?

Fri Jun 4 08:49:06 UTC 2010

Hi,

On 03/06/2010 Thursday, June 3, 18:52, "Peter"
<biopython at maubp.freeserve.co.uk> wrote:

> There are two major additions that have been discussed (and some
> code written too): gzip support and storing the index on disk.

[...]

> Now ideally we'd be able to offer both of these features - but if
> you had to vote, which would be most important and why?

On-disk indexing.  But does this not also lend itself (perhaps
eventually...) also to storing the whole dataset in SQLite or similar to
avoid syncing problems between the file and the index?  Wasn't that also
part of a discussion on the BIP list some time ago?

I've not looked at how you're already parsing from gzip files, so I hope
it's more time-efficient than what I used to do for bzip, which was write a
Pyrex wrapper to Flex, which was using the bzip2 library directly.  This was
not a speed improvement over uncompressing the file each time I needed to
open it (and then using Flex).  The same is true for Python's gzip module:

-rw-r--r--  1 lpritc  staff   110M 14 Apr 14:22
phytophthora_infestans_data.tar.gz

$ time gunzip phytophthora_infestans_data.tar.gz

real    0m18.359s
user    0m3.562s
sys    0m0.582s

Python 2.6 (trunk:66714:66715M, Oct  1 2008, 18:36:04)
[GCC 4.0.1 (Apple Computer, Inc. build 5370)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import time
>>> import gzip
>>> def gzip_time():
...     t0 = time.time()
...     f = gzip.open('phytophthora_infestans_data.tar.gz','rb')
...     f.read()
...     print time.time()-t0
... 
>>> gzip_time()
19.2009749413

If you know where your data is, it can be quicker to get to, but you still
need to uncompress each time, and it scales approximately linearly with
number of lines returned, as you'd expect:

>>> def read_lines(n):
...     t0 = time.time()
...     f = gzip.open('phytophthora_infestans_data.tar.gz', 'rb')
...     lset = [f.readline() for i in range(n)]
...     print time.time() - t0
...     return lset
...
>>> d = read_lines(1000)
0.0324518680573
>>> d = read_lines(10000)
0.11150097847
>>> d = read_lines(100000)
0.808992147446
>>> d = read_lines(1000000)
7.9017291069
>>> d = read_lines(2000000)
15.7361371517
>>> d = read_lines(3000000)
23.7589659691

The advantage to me was in the amount of disk space (and network transfer
time/bandwidth) saved by dealing with a compressed file.  In the end I
decided that, where data access was likely to be frequent, buying more
storage and handling uncompressed data would be a better option than dealing
directly with the compressed file:

-rw-r--r--  1 lpritc  staff   410M 14 Apr 14:22
phytophthora_infestans_data.tar

>>> def read_file():
...     t0 = time.time()
...     d = open('phytophthora_infestans_data.tar','rb').read()
...     print time.time() - t0
... 
>>> read_file()
0.620229959488

>>> def read_file_lines(n):
...     t0 = time.time()
...     f = open('phytophthora_infestans_data.1.tar', 'rb')
...     lset = [f.readline() for i in range(n)]
...     print time.time() - t0
...     return lset
... 
>>> d = read_file_lines(100)
0.000148057937622
>>> d = read_file_lines(1000)
0.000863075256348
>>> d = read_file_lines(10000)
0.00704002380371
>>> d = read_file_lines(100000)
0.0780401229858
>>> d = read_file_lines(1000000)
0.804203033447
>>> d = read_file_lines(2000000)
1.71462202072
>>> d = read_file_lines(4000000)
3.55472993851

I don't see (though I'm happy to be shown) how you can efficiently index
directly into the LZW/DEFLATE/BZIP compressed data.  If you're not
decompressing the whole thing in one go, I think you atill have to partially
decompress a section of the file (starting from the front of the file) to
retrieve your sequence each time.  Even if you index - say, by recording the
required buffer size/number of buffer decompressions and the offset of your
sequence in the output as the index.  This could save memory if you discard,
rather than cache, unwanted early output - but I'm not sure that it would be
time-efficient to do it for more than one or two (on average) sequences in a
compressed file.  You'd likely be better off spending your time waiting for
the file to decompress once and doing science with the time that's left over
;)  

I could be wrong, though...

Cheers,

L.

-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405

______________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.

DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries.  This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed.  It may not be disclosed or used by any other than that addressee.
If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any).
______________________________________________________