[Bioperl-l] Bio::Index::Fastq - Interface for indexing (multiple) fastq files failure

Mon Apr 5 23:57:02 UTC 2010

On Apr 5, 2010, at 6:15 PM, Peter wrote:

> On Mon, Apr 5, 2010 at 11:53 PM, Jason Stajich <jason at bioperl.org> wrote:
>> Hi David - I am not sure this is going to be the right tool for the job.
>> 
>> I'm concerned that none of the Bio::Index:: will really work for
>> Illumina/NGS size data because once you get beyond about 4M hash
>> keys things slow down quite dramatically and/or don't finish.
>> 
>> I think we have to consider SQLite implementations or some more
>> explicit way to handle larger keysize for hashes in the DB_File or
>> BerkeleyDB approach. A similar slow problem can be seen if you
>> just index a fastq converted fasta file from a single Illumina lane.
> 
> Another example, and this was in Python rather than Perl, but
> SQLite got a thumbs up over an in house hash based approach:
> 
> http://lists.idyll.org/pipermail/biology-in-python/2010-March/000511.html
> 
> I think a new SQLite based Bio* OBF successor to the existing
> BDB based OBDA standard for indexing files could be very interesting.
> 
> Peter

Would be nice to get some ideas performance-wise with some data sets.  SQLite is a very easy option (I'm using it routinely as well).

chris