[Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk?

Peter biopython at maubp.freeserve.co.uk
Mon Jun 7 11:02:12 UTC 2010


On Mon, Jun 7, 2010 at 11:38 AM, Kevin Lam <aboulia at gmail.com> wrote:
>
> Hi Peter
> Can I summarise it? I think alot of well meaning ppl are pushing for their
> fav db but using a disk based db like sqlite for Bio.SeqIO.index() for
> recording the file offset is going to be the best way to do it versus trying
> to find another suitable non-mysql db variant to 'databasify' the short
> read-data? As the latter would be relatively easy for anyone else
> interested to experiment to code their own scripts for their fav db
>
> :)

I think that is a good summary.

Bio.SeqIO.index() is for random access to assorted existing file formats
(e.g. FASTA. FASTQ, SFF) by record identifier string and works with a
look up table of offsets. We are going to try storing this lookup table in
SQLite. The proof of concept code works, is cross platform, adds no
external dependencies - and seems fast enough too.

As we add more file formats to Bio.SeqIO, in most cases we can add
support for indexing them in the same way. Maybe one day this will
include BioHDF as it matures?

Peter




More information about the Biopython mailing list