[Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk?

Fri Jun 4 20:21:33 UTC 2010

On Fri, Jun 4, 2010 at 8:33 PM, Brad Chapman wrote:
> Peter and all;
>
> More generally, I find having files gzipped while doing analysis is
> not very helpful. The time to gunzip and feed them into programs
> doesn't end up being worth the space tradeoff. My only real use of
> gzip is when archiving something that I'm done with.

It seems that in general support for random access to gzipped
files is of niche interest. Avoiding this in Bio.SeqIO.index() will
keep the API simple and I think will make the caching to disk
stuff a bit easier too.

>> > Regarding random access in compressed file, there is the BGZF
>> > format but I am not familiar enough with it to tell whether it can be
>> > of use here.
>>
>> I've been looking at that this afternoon as it is used in BAM files.
>
> What Broad does internally is store Fastq files in BAM format. You
> can convert with this Picard tool:
>
> http://picard.sourceforge.net/command-line-overview.shtml#FastqToSam

Thanks for the link - I knew I'd seen a FASTQ to unaligned SAM/BAM
tool out there somewhere.

> Originally when using their tools I thought this would be as annoying as
> gzipped files, but it is practically pretty nice since you can access
> them with pysam. Compression size is the same as if gzipped.

BAM files are a compressed with a variant of gzip (this BGZF
sub-format), so that isn't a big surprise ;)

> What do you think about co-opting the SAM/BAM format for this? This
> would make it more specific for things that can go into BAM (so no
> GenBank and what not), but would have the advantage of working with
> existing workflows.

I can see storing unmapped reads in BAM as a sensible alternative
to FASTQ. Note that you lose any descriptions (not usually important)
but more importantly BAM files do not store the sequence case
information (which is often used to encode trimming points).

Obviously we'd want to have SAM/BAM output support in Bio.SeqIO
to fully take advantage of this (grin). I'm keeping this in mind while
working on SAM/BAM parsing, but it would be a *lot* more work.

> Region based indexing is already implemented for BAM, but it would
> be really useful to also have ID based retrieval along the lines of
> what you are proposing.
>
> Brad

Yeah, I've been reading up on the BAM index format (BAI files) and
they don't do anything about read lookup by ID at all. So I haven't been
reinventing the wheel by trying to do Bio.SeqIO.index() support of
BAM - it should be complementary to the pysam stuff.

Anyway, even for BAM files we should be able to use the same
scheme as all the other file formats supported in Bio.SeqIO.index(),
use an SQLite database to hold the lookup table of read names to
file offsets (rather than a Python dictionary in memory as now).

Regards,

Peter