[Bioperl-l] Bio::Index::Fastq - Interface for indexing (multiple) fastq files failure

Mon Apr 5 23:15:28 UTC 2010

On Mon, Apr 5, 2010 at 11:53 PM, Jason Stajich <jason at bioperl.org> wrote:
> Hi David - I am not sure this is going to be the right tool for the job.
>
> I'm concerned that none of the Bio::Index:: will really work for
> Illumina/NGS size data because once you get beyond about 4M hash
> keys things slow down quite dramatically and/or don't finish.
>
> I think we have to consider SQLite implementations or some more
> explicit way to handle larger keysize for hashes in the DB_File or
> BerkeleyDB approach. A similar slow problem can be seen if you
> just index a fastq converted fasta file from a single Illumina lane.

Another example, and this was in Python rather than Perl, but
SQLite got a thumbs up over an in house hash based approach:

http://lists.idyll.org/pipermail/biology-in-python/2010-March/000511.html

I think a new SQLite based Bio* OBF successor to the existing
BDB based OBDA standard for indexing files could be very interesting.

Peter