[Bioperl-l] Bio::Index::Fastq '@' in qual
Fields, Christopher J
cjfields at illinois.edu
Tue Nov 1 13:44:25 EDT 2011
On Nov 1, 2011, at 11:02 AM, Jason Stajich wrote:
> I think a different indexer is needed for the scale of key/value pairs we see in fastq files if we want to make a fast lookup by ID. I think speed is of essence for this type of solution and so a forced all records must be 4 lines long is okay for this type of implementation.
This can always be an early optimization, that's easy enough. But I'm sure we will have to deal with multi-line seq/qual FASTQ at some point.
> I found NOSQL implementations to be much better performance and than any of the BDB type solutions -- they end up being really slow at above 1-5M keys. I used TokyoCabinet and KyotoCabinet to do indexing of accession -> taxonomy ID and found it quite fast for the needs. I haven't tried storing 100bp reads + qual string as the value in it yet but I think it could be done, certainly worth a prototype.
Adding a middle layer where the backend storage is abstracted is the probably the (best|most flexible) option, converging on a good default that will work for this data. The actual interface is in place, though would it be more feasible to go the OBDA (converge on a cross-Bio* compatible schema)? Or are there problems afoot there we're unaware of?
Re: specifics, I think Biopython uses SQLite, is that correct Peter?
> On Nov 1, 2011, at 7:38 AM, Peter Cock wrote:
>> On Tue, Nov 1, 2011 at 1:40 PM, Fields, Christopher J
>> <cjfields at illinois.edu> wrote:
>>> One problem the various Bio* indexers have currently is the lack of
>>> standardization on a specific schema for indexing. There are in-roads
>>> towards this (OBDA) that haven't been adequately traveled IMHO,
>>> which need to be taken up again.
>> Something to switch to open-bio-l at lists.open-bio.org for,
>> We can continue this thread from last summer,
>> And CC Peter Rice from EMBOSS too - we chatted about this
>> at ISMB/BOSC 2011 in July - and whomever looks after the
>> OBDA/indexing code in BioRuby and BioJava too.
>>> A second, and maybe this is more specific to BioPerl, is that the
>>> parsers and indexers essentially reimplement the format parsing
>>> in each module, so if there are bugs they have to be independently
>>> fixed (hence why SeqIO works and the indexer doesn't; I wrote the
>>> first but not the second). The best place for any optimizations
>>> would be in a unified parser that both the SeqIO and indexer
>>> modules could use.
>> We have that problem to an extent in Biopython's Bio.SeqIO code.
>> The indexing code duplicates some logic of the parsing code
>> (how much depends on the format), sufficient to extract the read
>> ID and the bounds on disk. The two could be more unified but
>> the parsers came first and didn't want to change them at the time.
>> Instead I tried to be rigorous in consistency testing for the index
>> code's unit tests.
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
More information about the Bioperl-l