[Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk?

Kevin Lam aboulia at gmail.com
Fri Jun 4 18:35:05 UTC 2010


>
>
> Parsing is easy - the point of this discussion is random access to
> any record within the stream (which requires jumping to an offset).
>
> Peter
>
apologies didn't follow the thread close enough. Now I understand why the
two might be overlapping.

I would still vote for sqlite3.
based on my short experience with next gen seq.
there's these other benefits

1)pairing of csfasta with qual files based on read name can be done easier +
stored in same db
2) pairing of mate pair and paired end reads can be done easier + stored in
same db
3)generation of fastq files from 1) can be done easier
4)double encoded fasta sequence and base space sequence for can be stored in
same db as well.

I think the bwt method of indexing and compression used in bowtie and bwa
for reference genomes might be a better way of going about the problem. That
said, I think generally disk space is seldom an issue with lowering costs.
Time / convenience is probably more important. The one time I wished for
smaller NGS files is when I need to do transfers.

Kevin



More information about the Biopython mailing list