[Open-bio-l] OBDA redux?
p.j.a.cock at googlemail.com
Sun Nov 13 07:24:35 EST 2011
On Thu, Nov 3, 2011 at 7:47 PM, Fields, Christopher J
<cjfields at illinois.edu> wrote:
> On Nov 3, 2011, at 1:52 PM, Peter Cock wrote:
>> On Thu, Nov 3, 2011 at 6:28 PM, Fields, Christopher J
>> <cjfields at illinois.edu> wrote:
>>> (side thread, so re-titling...)
>> And CC'ing open-bio-l, which is a better home for this than bioperl-l,
>> where OBDA v2 talk came up again in discussion of a BioPerl indexing
>> problem. Archive links for thread here:
> yes, good idea...
I've not CC'd the bioperl-l anymore.
>>> On Nov 1, 2011, at 1:06 PM, Peter Cock wrote:
>>>> Yes, we're using SQLite3 to store essentially a list of filenames
>>>> and their format as one table, and then in the main table an
>>>> entry for each sequence recording the ID (only one accession,
>>>> unlike OBDA which had infrastructure for a secondary accession),
>>>> file number, offset of the start of the record, and optionally the
>>>> length of the record on disk.
>>>> i.e. Basically what OBDA does, but using SQLite rather
>>>> than BDB (not included in Python 3) or a flat file index
>>>> (poor performance with large datasets).
>>>> I find this design attractive on several levels:
>>>> * File format neutral, covers FASTA, FASTQ, GenBank, etc
>>>> * Preserves the original file untouched
>>>> * Index is a small single file (thanks to SQLite)
>>>> * Back end could be switched out
>>>> * Could be applied to compressed file formats
>>>> * Reuses existing parsing code to access entries
>>>> This could easily form basis of OBDA v2, the main points
>>>> of difference I anticipate between the Bio* projects would
>>>> be naming conventions for the different file formats, and
>>>> what we consider to be the default record ID of each read
>>>> (e.g. which field in a GenBank file - although agreement
>>>> here is not essential). Some of that was already settled in
>>>> principle with OBDA v1.
>>> The primary/secondary IDs could be configurable with a sane
>>> default, I think the bioperl implementations allowed this (and
>>> it is certainly something that will be requested).
>> One reason I went with a single ID only was to keep the
>> Python dictionary based API simple (think hash in Perl).
>> You don't get secondary keys in a Python dict or a hash ;)
>> As a nod to flexibility, in Biopython's Bio.SeqIO indexing you
>> can provide a call back function to map the suggested ID to
>> something else. Obviously this doesn't give the full flexibility
>> of extracting a field from the record's annotation because we
>> don't parse the whole record during indexing (it would be too
> Same with bioperl.
>> However, I'm happy for there to be an *optional* secondary
>> key in an OBDA v2 SQLite schema, but Biopython probably
>> won't populate it. We could optionally use it rather than the
>> primary ID on loading an existing index though.
> Optional implementation of that is fine by me.
>> Personally I would stick with one key in the index - it should
>> be faster and makes it simpler to switch the back end if we
>> need to later. If anyone wants a second key, they can build
>> a second index *grin*.
> That's easy enough.
>>>> On the other hand, you could try and store the parsed data
>>>> itself, which is where NOSQL looks more interesting. That
>>>> essentially requires the ability to serialise your annotated
>>>> sequence object model to disk - which would be tricky to do
>>>> cross project (much more ambitious than BioSQL is). It also
>>>> means the "index" becomes very large because it now holds
>>>> all the original data.
>>> For a fully cross-Bio* compliant format, I don't think it's feasible
>>> to use serialized data unless they are serialized in something
>>> that is easily deserialized across HLLs (JSON, BSON, YAML,
>>> XML, etc). Either that, or such data is stored concurrently with
>>> the binary blob, along with meta data that indicates the source
>>> of the blob, parser, version, etc, etc (unless there are tools out
>>> there that reliably interconvert serialized complex data structures
>>> between HLLs). Anyway you go about it, it seems like it could
>>> be a major ball of hurt, unless implemented very carefully.
>> You missed out RDF as a serialisation ;)
>> But yes, going down the shared serialisation route is going
>> to be messy - as you are well aware:
>>> Aside: I think this was one of the problems with
>>> Bio::DB::SeqFeature::Store, in that it at one point stored
>>> Perl-specific Storable blobs.
> yes, it's a problem w/o an easy solution. Anyway, I think an
> implementation of such at this point would be a premature
So, Chris and I seem in general agreement that an OBDA v2
using SQLite but based on essentially the same approach as
the BDB or flat file based OBDA v1 is a good idea. i.e. Tables
mapping record identifiers to file offsets in the original sequence
I hope to get BioRuby on board, they already have an OBDA
v1 support so that shouldn't be too hard.
Right now I don't recall if BioJava has/had OBDA v1 support,
and if they did if it was affected in their recent move to BioJava
v3 (I understand from their mailing list that some older lower
priority functionality has not all been ported yet).
Also EMBOSS are likely to be interested, certainly Peter Rice
was interested in the SQLite indexing we're already using in
Biopython for sequence files (i.e. what is effectively the
prototype for OBDA v2).
Note that in addition to simple indexing of text files, we are
already using the same simple offset + length approach for
indexing binary files (e.g. SFF).
On the immediate practical side, I think I can edit the
current OBDA website of http://obda.open-bio.org/
via /home/websites/obda.open-bio.org/html on the
We need to work out where the current OBDA indexing
specification lives (CVS or SVN?) and perhaps move
that to github. We may need a general OBF organisation
account on git hub for this and any other cross-project
I see there is already an OBDA project on RedMine,
(Chris can you add me to that please?)
More information about the Open-Bio-l