[Bioperl-l] Bio::Index::Fastq - Interface for indexing (multiple) fastq files failure
Chris Fields
cjfields at illinois.edu
Mon Apr 5 23:57:02 UTC 2010
On Apr 5, 2010, at 6:15 PM, Peter wrote:
> On Mon, Apr 5, 2010 at 11:53 PM, Jason Stajich <jason at bioperl.org> wrote:
>> Hi David - I am not sure this is going to be the right tool for the job.
>>
>> I'm concerned that none of the Bio::Index:: will really work for
>> Illumina/NGS size data because once you get beyond about 4M hash
>> keys things slow down quite dramatically and/or don't finish.
>>
>> I think we have to consider SQLite implementations or some more
>> explicit way to handle larger keysize for hashes in the DB_File or
>> BerkeleyDB approach. A similar slow problem can be seen if you
>> just index a fastq converted fasta file from a single Illumina lane.
>
> Another example, and this was in Python rather than Perl, but
> SQLite got a thumbs up over an in house hash based approach:
>
> http://lists.idyll.org/pipermail/biology-in-python/2010-March/000511.html
>
> I think a new SQLite based Bio* OBF successor to the existing
> BDB based OBDA standard for indexing files could be very interesting.
>
> Peter
Would be nice to get some ideas performance-wise with some data sets. SQLite is a very easy option (I'm using it routinely as well).
chris
More information about the Bioperl-l
mailing list