[Biopython] Indexing large sequence files

Fri Jun 19 11:12:17 UTC 2009

On Fri, Jun 19, 2009 at 10:49 AM, Peter<biopython at maubp.freeserve.co.uk> wrote:
>
> OK, so it wasn't off list. Never mind - hopefully my email made
> sense, there were more typos than usual! I'm trying this now
> on a large FASTQ file...

OK, first of all I had problems with using pickle protocol 2 with
SeqRecord objects, but protocols 0 and 1 seem to work fine.
I'm not quite sure what was going wrong there.

I got this to work on a 1 million read FASTQ file (short reads
from Solexa), but the time to build the shelve index and the
disc space it requires do seem to be prohibitive.

I also redid my old ad-hoc zlib-pickle index on disk, and while
the indexing time was similar, my index file is much more
compact. The large shelve index file is a known issue - the
file format is quite complicated because it allows you to
change the index in situ etc.

Either way, having an index file holding even compressed
pickled versions of SeqRecord objects takes at least three
times as much space as the original FASTQ file.

So, for millions of records, I am going off the shelve/pickle
idea. Storing offsets in the original sequence file does seem
more practical here.

Peter