[Open-bio-l] OBDA redux?
Fields, Christopher J
cjfields at illinois.edu
Thu Nov 3 15:47:51 EDT 2011
On Nov 3, 2011, at 1:52 PM, Peter Cock wrote:
> On Thu, Nov 3, 2011 at 6:28 PM, Fields, Christopher J
> <cjfields at illinois.edu> wrote:
>> (side thread, so re-titling...)
> And CC'ing open-bio-l, which is a better home for this than bioperl-l,
> where OBDA v2 talk came up again in discussion of a BioPerl indexing
> problem. Archive links for thread here:
yes, good idea...
>> On Nov 1, 2011, at 1:06 PM, Peter Cock wrote:
>>> Yes, we're using SQLite3 to store essentially a list of filenames
>>> and their format as one table, and then in the main table an
>>> entry for each sequence recording the ID (only one accession,
>>> unlike OBDA which had infrastructure for a secondary accession),
>>> file number, offset of the start of the record, and optionally the
>>> length of the record on disk.
>>> i.e. Basically what OBDA does, but using SQLite rather
>>> than BDB (not included in Python 3) or a flat file index
>>> (poor performance with large datasets).
>>> I find this design attractive on several levels:
>>> * File format neutral, covers FASTA, FASTQ, GenBank, etc
>>> * Preserves the original file untouched
>>> * Index is a small single file (thanks to SQLite)
>>> * Back end could be switched out
>>> * Could be applied to compressed file formats
>>> * Reuses existing parsing code to access entries
>>> This could easily form basis of OBDA v2, the main points
>>> of difference I anticipate between the Bio* projects would
>>> be naming conventions for the different file formats, and
>>> what we consider to be the default record ID of each read
>>> (e.g. which field in a GenBank file - although agreement
>>> here is not essential). Some of that was already settled in
>>> principle with OBDA v1.
>> The primary/secondary IDs could be configurable with a sane
>> default, I think the bioperl implementations allowed this (and
>> it is certainly something that will be requested).
> One reason I went with a single ID only was to keep the
> Python dictionary based API simple (think hash in Perl).
> You don't get secondary keys in a Python dict or a hash ;)
> As a nod to flexibility, in Biopython's Bio.SeqIO indexing you
> can provide a call back function to map the suggested ID to
> something else. Obviously this doesn't give the full flexibility
> of extracting a field from the record's annotation because we
> don't parse the whole record during indexing (it would be too
Same with bioperl.
> However, I'm happy for there to be an *optional* secondary
> key in an OBDA v2 SQLite schema, but Biopython probably
> won't populate it. We could optionally use it rather than the
> primary ID on loading an existing index though.
Optional implementation of that is fine by me.
> Personally I would stick with one key in the index - it should
> be faster and makes it simpler to switch the back end if we
> need to later. If anyone wants a second key, they can build
> a second index *grin*.
That's easy enough.
>>> On the other hand, you could try and store the parsed data
>>> itself, which is where NOSQL looks more interesting. That
>>> essentially requires the ability to serialise your annotated
>>> sequence object model to disk - which would be tricky to do
>>> cross project (much more ambitious than BioSQL is). It also
>>> means the "index" becomes very large because it now holds
>>> all the original data.
>> For a fully cross-Bio* compliant format, I don't think it's feasible
>> to use serialized data unless they are serialized in something
>> that is easily deserialized across HLLs (JSON, BSON, YAML,
>> XML, etc). Either that, or such data is stored concurrently with
>> the binary blob, along with meta data that indicates the source
>> of the blob, parser, version, etc, etc (unless there are tools out
>> there that reliably interconvert serialized complex data structures
>> between HLLs). Anyway you go about it, it seems like it could
>> be a major ball of hurt, unless implemented very carefully.
> You missed out RDF as a serialisation ;)
> But yes, going down the shared serialisation route is going
> to be messy - as you are well aware:
>> Aside: I think this was one of the problems with
>> Bio::DB::SeqFeature::Store, in that it at one point stored
>> Perl-specific Storable blobs.
yes, it's a problem w/o an easy solution. Anyway, I think an implementation of such at this point would be a premature optimization.
More information about the Open-Bio-l