[Bioperl-l] Bio::Index::Fastq '@' in qual

Tue Nov 1 17:44:25 UTC 2011

On Nov 1, 2011, at 11:02 AM, Jason Stajich wrote:

> I think a different indexer is needed for the scale of key/value pairs we see in fastq files if we want to make a fast lookup by ID. I think speed is of essence for this type of solution and so a forced all records must be 4 lines long is okay for this type of implementation. 

This can always be an early optimization, that's easy enough. But I'm sure we will have to deal with multi-line seq/qual FASTQ at some point.  

> I found NOSQL implementations to be much better performance and than any of the BDB type solutions -- they end up being really slow at above 1-5M keys.  I used TokyoCabinet and KyotoCabinet to do indexing of accession -> taxonomy ID and found it quite fast for the needs. I haven't tried storing 100bp reads + qual string as the value in it yet but I think it could be done, certainly worth a prototype.

Adding a middle layer where the backend storage is abstracted is the probably the (best|most flexible) option, converging on a good default that will work for this data.  The actual interface is in place, though would it be more feasible to go the OBDA (converge on a cross-Bio* compatible schema)?  Or are there problems afoot there we're unaware of?

Re: specifics, I think Biopython uses SQLite, is that correct Peter?  

chris

> Jason
> On Nov 1, 2011, at 7:38 AM, Peter Cock wrote:
> 
>> On Tue, Nov 1, 2011 at 1:40 PM, Fields, Christopher J
>> <cjfields at illinois.edu> wrote:
>>> 
>>> One problem the various Bio* indexers have currently is the lack of
>>> standardization on a specific schema for indexing.  There are in-roads
>>> towards this (OBDA) that haven't been adequately traveled IMHO,
>>> which need to be taken up again.
>>> 
>> 
>> Something to switch to open-bio-l at lists.open-bio.org for,
>> http://lists.open-bio.org/mailman/listinfo/open-bio-l
>> 
>> We can continue this thread from last summer,
>> http://lists.open-bio.org/pipermail/open-bio-l/2010-April/000662.html
>> http://lists.open-bio.org/pipermail/open-bio-l/2010-June/000676.html
>> ...
>> http://lists.open-bio.org/pipermail/open-bio-l/2010-June/000680.html
>> 
>> And CC Peter Rice from EMBOSS too - we chatted about this
>> at ISMB/BOSC 2011 in July - and whomever looks after the
>> OBDA/indexing code in BioRuby and BioJava too.
>> 
>>> A second, and maybe this is more specific to BioPerl, is that the
>>> parsers and indexers essentially reimplement the format parsing
>>> in each module, so if there are bugs they have to be independently
>>> fixed (hence why SeqIO works and the indexer doesn't; I wrote the
>>> first but not the second).  The best place for any optimizations
>>> would be in a unified parser that both the SeqIO and indexer
>>> modules could use.
>> 
>> We have that problem to an extent in Biopython's Bio.SeqIO code.
>> The indexing code duplicates some logic of the parsing code
>> (how much depends on the format), sufficient to extract the read
>> ID and the bounds on disk. The two could be more unified but
>> the parsers came first and didn't want to change them at the time.
>> Instead I tried to be rigorous in consistency testing for the index
>> code's unit tests.
>> 
>> Regards,
>> 
>> Peter
>> 
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>