[Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk?

Fri Jun 4 10:53:14 UTC 2010

I vote for sqlite index. Have been using bsddb to do the same but the  
db is inflated compared to plain text. Performance is not bad using  
btree

For gzip I feel it might be possible to gunzip into a stream which  
biopython can parse on the fly?

Kev

Sent from my iPod

On 04-Jun-2010, at 1:52 AM, Peter <biopython at maubp.freeserve.co.uk>  
wrote:

> Dear Biopythoneers,
>
> We've had several discussions (mostly on the development list) about
> extending the Bio.SeqIO.index() functionality. For a quick recap of  
> what
> you can do with it right now, see:
>
> http://news.open-bio.org/news/2009/09/biopython-seqio-index/
> http://news.open-bio.org/news/2010/04/partial-seq-files-biopython/
>
> There are two major additions that have been discussed (and some
> code written too): gzip support and storing the index on disk.
>
> Currently Bio.SeqIO.index() has to re-index the sequence file each
> time you run your script. If you run the same script often, it would  
> be
> useful to be able to save the index information to disk.  The idea is
> that you can then load the index file and get almost immediate
> random access to the associated sequence file (with waiting to scan
> the file to rebuild the index). The old OBDA style indexes used by
> BioPerl, BioRuby etc are one possible file format we might use, but
> a simple SQLite database may be preferable. This also would give
> us a way to index really big files with many millions of reads without
> keeping the file offsets in memory. This is going to be important for
> random access to the latest massive sequencing data files.
>
> Next, support for indexing compressed files (initially concentrating
> on Unix style gzipped files, e.g. example.fasta.gz) without having
> to decompress the whole file. You can already parse these files
> with Bio.SeqIO in conjunction with the Python gzip module. It would
> be nice to be able to index them too.
>
> Now ideally we'd be able to offer both of these features - but if
> you had to vote, which would be most important and why?
>
> Peter
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython