[Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk?

Brad Chapman chapmanb at 50mail.com
Fri Jun 4 19:33:58 UTC 2010


Peter and all;

> > One note of caution: Python's gzip module is slow, or so I experienced... to
> > the point that I ended up wrapping the code into a function that gunzipped
> > the file to a temporary location, parse and extract information, then delete
> > the temporary file.

More generally, I find having files gzipped while doing analysis is
not very helpful. The time to gunzip and feed them into programs
doesn't end up being worth the space tradeoff. My only real use of
gzip is when archiving something that I'm done with.

> > Regarding random access in compressed file, there is the BGZF format but I
> > am not familiar enough with it to tell whether it can be of use here.
>
> I've been looking at that this afternoon as it is used in BAM files. 

What Broad does internally is store Fastq files in BAM format. You
can convert with this Picard tool:

http://picard.sourceforge.net/command-line-overview.shtml#FastqToSam

Originally when using their tools I thought this would be as annoying as
gzipped files, but it is practically pretty nice since you can access
them with pysam. Compression size is the same as if gzipped.

What do you think about co-opting the SAM/BAM format for this? This
would make it more specific for things that can go into BAM (so no
GenBank and what not), but would have the advantage of working with
existing workflows.

Region based indexing is already implemented for BAM, but it would
be really useful to also have ID based retrieval along the lines of
what you are proposing.

Brad



More information about the Biopython mailing list