[Bioperl-l] Bio::Index::Fastq '@' in qual

Tue Nov 1 12:02:24 EDT 2011

I think a different indexer is needed for the scale of key/value pairs we see in fastq files if we want to make a fast lookup by ID. I think speed is of essence for this type of solution and so a forced all records must be 4 lines long is okay for this type of implementation. 

I found NOSQL implementations to be much better performance and than any of the BDB type solutions -- they end up being really slow at above 1-5M keys.  I used TokyoCabinet and KyotoCabinet to do indexing of accession -> taxonomy ID and found it quite fast for the needs. I haven't tried storing 100bp reads + qual string as the value in it yet but I think it could be done, certainly worth a prototype.

Jason
On Nov 1, 2011, at 7:38 AM, Peter Cock wrote:

> On Tue, Nov 1, 2011 at 1:40 PM, Fields, Christopher J
> <cjfields at illinois.edu> wrote:
>> 
>> One problem the various Bio* indexers have currently is the lack of
>> standardization on a specific schema for indexing.  There are in-roads
>> towards this (OBDA) that haven't been adequately traveled IMHO,
>> which need to be taken up again.
>> 
> 
> Something to switch to open-bio-l at lists.open-bio.org for,
> http://lists.open-bio.org/mailman/listinfo/open-bio-l
> 
> We can continue this thread from last summer,
> http://lists.open-bio.org/pipermail/open-bio-l/2010-April/000662.html
> http://lists.open-bio.org/pipermail/open-bio-l/2010-June/000676.html
> ...
> http://lists.open-bio.org/pipermail/open-bio-l/2010-June/000680.html
> 
> And CC Peter Rice from EMBOSS too - we chatted about this
> at ISMB/BOSC 2011 in July - and whomever looks after the
> OBDA/indexing code in BioRuby and BioJava too.
> 
>> A second, and maybe this is more specific to BioPerl, is that the
>> parsers and indexers essentially reimplement the format parsing
>> in each module, so if there are bugs they have to be independently
>> fixed (hence why SeqIO works and the indexer doesn't; I wrote the
>> first but not the second).  The best place for any optimizations
>> would be in a unified parser that both the SeqIO and indexer
>> modules could use.
> 
> We have that problem to an extent in Biopython's Bio.SeqIO code.
> The indexing code duplicates some logic of the parsing code
> (how much depends on the format), sufficient to extract the read
> ID and the bounds on disk. The two could be more unified but
> the parsers came first and didn't want to change them at the time.
> Instead I tried to be rigorous in consistency testing for the index
> code's unit tests.
> 
> Regards,
> 
> Peter
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l