[Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk?

Fri Jun 4 18:25:42 UTC 2010

On 04/06/10 18:00, biopython-request at lists.open-bio.org wrote:
>
> On Fri, Jun 4, 2010 at 11:53 AM, Kevin<aboulia at gmail.com>  wrote:
>> I vote for sqlite index. Have been using bsddb to do the same but the db
>> is inflated compared to plain text. Performance is not bad using btree
>
> The other major point against bsddb is that future versions of Python
> will not include it in the standard library - but Python 2.5+ does have
> sqlite3 included.
>
>> For gzip I feel it might be possible to gunzip into a stream which
>> biopython can parse on the fly?
>
> Yes of course, like this:
>
> import gzip
> from Bio import SeqIO
> handle = gzip.open("uniprot_sprot.dat.gz")
> for record in SeqIO.parse(handle, "swiss"): print record.id
> handle.close()
>
> Parsing is easy - the point of this discussion is random access to
> any record within the stream (which requires jumping to an offset).
>
> Peter
>

One note of caution: Python's gzip module is slow, or so I 
experienced... to the point that I ended up wrapping the code into a 
function that gunzipped the file to a temporary location, parse and 
extract information, then delete the temporary file.

Regarding random access in compressed file, there is the BGZF format but 
I am not familiar enough with it to tell whether it can be of use here.

More generally, compression is part of the HDF5 format and this with 
chunks could prove the most battle-tested way to access entries randomly.

L.