[Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk?
Renato Alves
rjalves at igc.gulbenkian.pt
Fri Jun 4 04:10:22 EDT 2010
Hi Peter and all,
Considering the fact that the first addition 'is' a potential problem,
while the second is more of an optimization, I would put my vote on the
first. In addition, an sqlite or similar solution would also allow one
to use the indexing feature on short run applications where
recalculating the index every time is a costly (sometimes too much)
operation.
Obviously the second would be of great use if put together with the
first, but I'm a little bit biased on that since I was part of the group
that raised the gzip question in the mailing list some time ago.
Regards,
Renato
Quoting Peter on 06/03/2010 06:52 PM:
> Dear Biopythoneers,
>
> We've had several discussions (mostly on the development list) about
> extending the Bio.SeqIO.index() functionality. For a quick recap of what
> you can do with it right now, see:
>
> http://news.open-bio.org/news/2009/09/biopython-seqio-index/
> http://news.open-bio.org/news/2010/04/partial-seq-files-biopython/
>
> There are two major additions that have been discussed (and some
> code written too): gzip support and storing the index on disk.
>
> Currently Bio.SeqIO.index() has to re-index the sequence file each
> time you run your script. If you run the same script often, it would be
> useful to be able to save the index information to disk. The idea is
> that you can then load the index file and get almost immediate
> random access to the associated sequence file (with waiting to scan
> the file to rebuild the index). The old OBDA style indexes used by
> BioPerl, BioRuby etc are one possible file format we might use, but
> a simple SQLite database may be preferable. This also would give
> us a way to index really big files with many millions of reads without
> keeping the file offsets in memory. This is going to be important for
> random access to the latest massive sequencing data files.
>
> Next, support for indexing compressed files (initially concentrating
> on Unix style gzipped files, e.g. example.fasta.gz) without having
> to decompress the whole file. You can already parse these files
> with Bio.SeqIO in conjunction with the Python gzip module. It would
> be nice to be able to index them too.
>
> Now ideally we'd be able to offer both of these features - but if
> you had to vote, which would be most important and why?
>
> Peter
> _______________________________________________
> Biopython mailing list - Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
More information about the Biopython
mailing list