[Bioperl-l] OBDA redux? was Re: Bio::Index::Fastq '@' in qual

Thu Nov 3 14:28:36 EDT 2011

(side thread, so re-titling...)

On Nov 1, 2011, at 1:06 PM, Peter Cock wrote:

> On Tue, Nov 1, 2011 at 5:44 PM, Fields, Christopher J
> <cjfields at illinois.edu> wrote:
>> On Nov 1, 2011, at 11:02 AM, Jason Stajich wrote:
>> 
>>> I think a different indexer is needed for the scale of key/value
>>> pairs we see in fastq files if we want to make a fast lookup by
>>> ID. I think speed is of essence for this type of solution and so
>>> a forced all records must be 4 lines long is okay for this type
>>> of implementation.
>> 
>> This can always be an early optimization, that's easy enough.
>> But I'm sure we will have to deal with multi-line seq/qual
>> FASTQ at some point.
>> 
>>> I found NOSQL implementations to be much better
>>> performance and than any of the BDB type solutions -- they
>>> end up being really slow at above 1-5M keys.  I used
>>> TokyoCabinet and KyotoCabinet to do indexing of accession
>>> -> taxonomy ID and found it quite fast for the needs. I
>>> haven't tried storing 100bp reads + qual string as the
>>> value in it yet but I think it could be done, certainly worth
>>> a prototype.
>> 
>> Adding a middle layer where the backend storage is abstracted
>> is the probably the (best|most flexible) option, converging on a
>> good default that will work for this data.  The actual interface is
>> in place, though would it be more feasible to go the OBDA
>> (converge on a cross-Bio* compatible schema)?  Or are there
>> problems afoot there we're unaware of?
>> 
>> Re: specifics, I think Biopython uses SQLite, is that correct Peter?
>> 
>> chris
> 
> Yes, we're using SQLite3 to store essentially a list of filenames
> and their format as one table, and then in the main table an
> entry for each sequence recording the ID (only one accession,
> unlike OBDA which had infrastructure for a secondary accession),
> file number, offset of the start of the record, and optionally the
> length of the record on disk.
> 
> i.e. Basically what OBDA does, but using SQLite rather
> than BDB (not included in Python 3) or a flat file index
> (poor performance with large datasets).
> 
> I find this design attractive on several levels:
> * File format neutral, covers FASTA, FASTQ, GenBank, etc
> * Preserves the original file untouched
> * Index is a small single file (thanks to SQLite)
> * Back end could be switched out
> * Could be applied to compressed file formats
> * Reuses existing parsing code to access entries
> 
> This could easily form basis of OBDA v2, the main points
> of difference I anticipate between the Bio* projects would
> be naming conventions for the different file formats, and
> what we consider to be the default record ID of each read
> (e.g. which field in a GenBank file - although agreement
> here is not essential). Some of that was already settled in
> principle with OBDA v1.

The primary/secondary IDs could be configurable with a sane default, I think the bioperl implementations allowed this (and it is certainly something that will be requested).

> On the other hand, you could try and store the parsed data
> itself, which is where NOSQL looks more interesting. That
> essentially requires the ability to serialise your annotated
> sequence object model to disk - which would be tricky to do
> cross project (much more ambitious than BioSQL is). It also
> means the "index" becomes very large because it now holds
> all the original data.
> 
> Peter

For a fully cross-Bio* compliant format, I don't think it's feasible to use serialized data unless they are serialized in something that is easily deserialized across HLLs (JSON, BSON, YAML, XML, etc).  Either that, or such data is stored concurrently with the binary blob, along with meta data that indicates the source of the blob, parser, version, etc, etc (unless there are tools out there that reliably interconvert serialized complex data structures between HLLs).  Anyway you go about it, it seems like it could be a major ball of hurt, unless implemented very carefully.

Aside: I think this was one of the problems with Bio::DB::SeqFeature::Store, in that it at one point stored Perl-specific Storable blobs.

chris