[Open-bio-l] OBDA redux?

Thu Nov 3 14:52:50 EDT 2011

On Thu, Nov 3, 2011 at 6:28 PM, Fields, Christopher J
<cjfields at illinois.edu> wrote:
> (side thread, so re-titling...)
>

And CC'ing open-bio-l, which is a better home for this than bioperl-l,
where OBDA v2 talk came up again in discussion of a BioPerl indexing
problem. Archive links for thread here:

http://lists.open-bio.org/pipermail/bioperl-l/2011-November/035807.html
http://lists.open-bio.org/pipermail/bioperl-l/2011-November/035808.html
http://lists.open-bio.org/pipermail/bioperl-l/2011-November/035811.html
http://lists.open-bio.org/pipermail/bioperl-l/2011-November/035812.html
http://lists.open-bio.org/pipermail/bioperl-l/2011-November/035813.html
http://lists.open-bio.org/pipermail/bioperl-l/2011-November/035822.html

> On Nov 1, 2011, at 1:06 PM, Peter Cock wrote:
>>
>> Yes, we're using SQLite3 to store essentially a list of filenames
>> and their format as one table, and then in the main table an
>> entry for each sequence recording the ID (only one accession,
>> unlike OBDA which had infrastructure for a secondary accession),
>> file number, offset of the start of the record, and optionally the
>> length of the record on disk.
>>
>> i.e. Basically what OBDA does, but using SQLite rather
>> than BDB (not included in Python 3) or a flat file index
>> (poor performance with large datasets).
>>
>> I find this design attractive on several levels:
>> * File format neutral, covers FASTA, FASTQ, GenBank, etc
>> * Preserves the original file untouched
>> * Index is a small single file (thanks to SQLite)
>> * Back end could be switched out
>> * Could be applied to compressed file formats
>> * Reuses existing parsing code to access entries
>>
>> This could easily form basis of OBDA v2, the main points
>> of difference I anticipate between the Bio* projects would
>> be naming conventions for the different file formats, and
>> what we consider to be the default record ID of each read
>> (e.g. which field in a GenBank file - although agreement
>> here is not essential). Some of that was already settled in
>> principle with OBDA v1.
>
> The primary/secondary IDs could be configurable with a sane
> default, I think the bioperl implementations allowed this (and
> it is certainly something that will be requested).

One reason I went with a single ID only was to keep the
Python dictionary based API simple (think hash in Perl).
You don't get secondary keys in a Python dict or a hash ;)

As a nod to flexibility, in Biopython's Bio.SeqIO indexing you
can provide a call back function to map the suggested ID to
something else. Obviously this doesn't give the full flexibility
of extracting a field from the record's annotation because we
don't parse the whole record during indexing (it would be too
slow).

However, I'm happy for there to be an *optional* secondary
key in an OBDA v2 SQLite schema, but Biopython probably
won't populate it. We could optionally use it rather than the
primary ID on loading an existing index though.

Personally I would stick with one key in the index - it should
be faster and makes it simpler to switch the back end if we
need to later. If anyone wants a second key, they can build
a second index *grin*.

>> On the other hand, you could try and store the parsed data
>> itself, which is where NOSQL looks more interesting. That
>> essentially requires the ability to serialise your annotated
>> sequence object model to disk - which would be tricky to do
>> cross project (much more ambitious than BioSQL is). It also
>> means the "index" becomes very large because it now holds
>> all the original data.
>>
>> Peter
>
> For a fully cross-Bio* compliant format, I don't think it's feasible
> to use serialized data unless they are serialized in something
> that is easily deserialized across HLLs (JSON, BSON, YAML,
> XML, etc).  Either that, or such data is stored concurrently with
> the binary blob, along with meta data that indicates the source
> of the blob, parser, version, etc, etc (unless there are tools out
> there that reliably interconvert serialized complex data structures
> between HLLs).  Anyway you go about it, it seems like it could
> be a major ball of hurt, unless implemented very carefully.

You missed out RDF as a serialisation ;)

But yes, going down the shared serialisation route is going
to be messy - as you are well aware:

> Aside: I think this was one of the problems with
> Bio::DB::SeqFeature::Store, in that it at one point stored
> Perl-specific Storable blobs.
>
> chris

Peter