[Bioperl-l] Bio::Index::Fastq - Interface for indexing (multiple) fastq files failure

Tue Apr 6 19:47:59 UTC 2010

No problem, it points to issue in the current implementation that need addressing.

Jason, you thinking we just need to replace BDB with SQLite, or you thinking something else?  

chris

On Apr 6, 2010, at 2:38 PM, KOVALIC, DAVID K [AG/1000] wrote:

> Guys,
> 
> Thanks for information; it is good to know what the problem is.
> 
> I am afraid I am not much of a programmer so I am not liable to be much
> help with any work switching out the back-end. I can however volunteer
> for testing purposes if this helps at all.
> 
> I think this is just a case of NGS data volumes having overtaken a
> previously adequate implementations.
> 
> David
> 
> 
> -----Original Message-----
> From: Chris Fields [mailto:cjfields at illinois.edu] 
> Sent: Monday, April 05, 2010 6:57 PM
> To: Peter
> Cc: Jason Stajich; KOVALIC, DAVID K [AG/1000]; bioperl-l at bioperl.org
> Subject: Re: [Bioperl-l] Bio::Index::Fastq - Interface for indexing
> (multiple) fastq files failure
> 
> On Apr 5, 2010, at 6:15 PM, Peter wrote:
> 
>> On Mon, Apr 5, 2010 at 11:53 PM, Jason Stajich <jason at bioperl.org>
> wrote:
>>> Hi David - I am not sure this is going to be the right tool for the
> job.
>>> 
>>> I'm concerned that none of the Bio::Index:: will really work for
>>> Illumina/NGS size data because once you get beyond about 4M hash
>>> keys things slow down quite dramatically and/or don't finish.
>>> 
>>> I think we have to consider SQLite implementations or some more
>>> explicit way to handle larger keysize for hashes in the DB_File or
>>> BerkeleyDB approach. A similar slow problem can be seen if you
>>> just index a fastq converted fasta file from a single Illumina lane.
>> 
>> Another example, and this was in Python rather than Perl, but
>> SQLite got a thumbs up over an in house hash based approach:
>> 
>> 
> http://lists.idyll.org/pipermail/biology-in-python/2010-March/000511.htm
> l
>> 
>> I think a new SQLite based Bio* OBF successor to the existing
>> BDB based OBDA standard for indexing files could be very interesting.
>> 
>> Peter
> 
> Would be nice to get some ideas performance-wise with some data sets.
> SQLite is a very easy option (I'm using it routinely as well).
> 
> chris
> 
> ---------------------------------------------------------------------------------------------------------
> This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited.
> 
> 
> All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of "Viruses" or other "Malware". Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment.
> ---------------------------------------------------------------------------------------------------------
>