[Bioperl-l] Bio::Index::Fastq '@' in qual

Tue Nov 1 14:06:50 EDT 2011

On Tue, Nov 1, 2011 at 5:44 PM, Fields, Christopher J
<cjfields at illinois.edu> wrote:
> On Nov 1, 2011, at 11:02 AM, Jason Stajich wrote:
>
>> I think a different indexer is needed for the scale of key/value
>> pairs we see in fastq files if we want to make a fast lookup by
>> ID. I think speed is of essence for this type of solution and so
>> a forced all records must be 4 lines long is okay for this type
>> of implementation.
>
> This can always be an early optimization, that's easy enough.
> But I'm sure we will have to deal with multi-line seq/qual
> FASTQ at some point.
>
>> I found NOSQL implementations to be much better
>> performance and than any of the BDB type solutions -- they
>> end up being really slow at above 1-5M keys.  I used
>> TokyoCabinet and KyotoCabinet to do indexing of accession
>> -> taxonomy ID and found it quite fast for the needs. I
>> haven't tried storing 100bp reads + qual string as the
>> value in it yet but I think it could be done, certainly worth
>> a prototype.
>
> Adding a middle layer where the backend storage is abstracted
> is the probably the (best|most flexible) option, converging on a
> good default that will work for this data.  The actual interface is
> in place, though would it be more feasible to go the OBDA
> (converge on a cross-Bio* compatible schema)?  Or are there
> problems afoot there we're unaware of?
>
> Re: specifics, I think Biopython uses SQLite, is that correct Peter?
>
> chris

Yes, we're using SQLite3 to store essentially a list of filenames
and their format as one table, and then in the main table an
entry for each sequence recording the ID (only one accession,
unlike OBDA which had infrastructure for a secondary accession),
file number, offset of the start of the record, and optionally the
length of the record on disk.

i.e. Basically what OBDA does, but using SQLite rather
than BDB (not included in Python 3) or a flat file index
(poor performance with large datasets).

I find this design attractive on several levels:
* File format neutral, covers FASTA, FASTQ, GenBank, etc
* Preserves the original file untouched
* Index is a small single file (thanks to SQLite)
* Back end could be switched out
* Could be applied to compressed file formats
* Reuses existing parsing code to access entries

This could easily form basis of OBDA v2, the main points
of difference I anticipate between the Bio* projects would
be naming conventions for the different file formats, and
what we consider to be the default record ID of each read
(e.g. which field in a GenBank file - although agreement
here is not essential). Some of that was already settled in
principle with OBDA v1.

On the other hand, you could try and store the parsed data
itself, which is where NOSQL looks more interesting. That
essentially requires the ability to serialise your annotated
sequence object model to disk - which would be tricky to do
cross project (much more ambitious than BioSQL is). It also
means the "index" becomes very large because it now holds
all the original data.

Peter