[Bioperl-l] Bio::Index::Fastq - Interface for indexing (multiple) fastq files failure

Jason Stajich jason at bioperl.org
Tue Apr 6 20:28:18 UTC 2010


I think it is a SQLite is a good solution but I still found things a bit 
slow when I was storing all the data in the db, but if we are instead 
just indexing byte offsets in the file (which is what the current 
indexing is doing) maybe it will perform well enough.

One question on implementing this is do we want to have plug-in 
implementations to the Bio::Index:: classes (and Bio::DB::Fasta as well 
I would think) that can abstract the indexing method or just a new 
implementation as Bio::Index::FastqSQLite...   Or we can just replace 
BDB/DB_File with SQLite and now have a new required dependency?

I'd want to also look at the solutions employed in some of the short 
read aligners if they do index the fastq files in any other way.

-jason
Chris Fields wrote, On 4/6/10 12:47 PM:
> No problem, it points to issue in the current implementation that need addressing.
>
> Jason, you thinking we just need to replace BDB with SQLite, or you thinking something else?
>
> chris
>
> On Apr 6, 2010, at 2:38 PM, KOVALIC, DAVID K [AG/1000] wrote:
>
>    
>> Guys,
>>
>> Thanks for information; it is good to know what the problem is.
>>
>> I am afraid I am not much of a programmer so I am not liable to be much
>> help with any work switching out the back-end. I can however volunteer
>> for testing purposes if this helps at all.
>>
>> I think this is just a case of NGS data volumes having overtaken a
>> previously adequate implementations.
>>
>> David
>>
>>
>> -----Original Message-----
>> From: Chris Fields [mailto:cjfields at illinois.edu]
>> Sent: Monday, April 05, 2010 6:57 PM
>> To: Peter
>> Cc: Jason Stajich; KOVALIC, DAVID K [AG/1000]; bioperl-l at bioperl.org
>> Subject: Re: [Bioperl-l] Bio::Index::Fastq - Interface for indexing
>> (multiple) fastq files failure
>>
>> On Apr 5, 2010, at 6:15 PM, Peter wrote:
>>
>>      
>>> On Mon, Apr 5, 2010 at 11:53 PM, Jason Stajich<jason at bioperl.org>
>>>        
>> wrote:
>>      
>>>> Hi David - I am not sure this is going to be the right tool for the
>>>>          
>> job.
>>      
>>>> I'm concerned that none of the Bio::Index:: will really work for
>>>> Illumina/NGS size data because once you get beyond about 4M hash
>>>> keys things slow down quite dramatically and/or don't finish.
>>>>
>>>> I think we have to consider SQLite implementations or some more
>>>> explicit way to handle larger keysize for hashes in the DB_File or
>>>> BerkeleyDB approach. A similar slow problem can be seen if you
>>>> just index a fastq converted fasta file from a single Illumina lane.
>>>>          
>>> Another example, and this was in Python rather than Perl, but
>>> SQLite got a thumbs up over an in house hash based approach:
>>>
>>>
>>>        
>> http://lists.idyll.org/pipermail/biology-in-python/2010-March/000511.htm
>> l
>>      
>>> I think a new SQLite based Bio* OBF successor to the existing
>>> BDB based OBDA standard for indexing files could be very interesting.
>>>
>>> Peter
>>>        
>> Would be nice to get some ideas performance-wise with some data sets.
>> SQLite is a very easy option (I'm using it routinely as well).
>>
>> chris
>>
>> ---------------------------------------------------------------------------------------------------------
>> This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited.
>>
>>
>> All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of "Viruses" or other "Malware". Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment.
>> ---------------------------------------------------------------------------------------------------------
>>
>>      
>
>    



More information about the Bioperl-l mailing list