[Open-bio-l] OBDA redux?
Toshiaki Katayama
k at bioruby.org
Thu Nov 17 00:00:50 UTC 2011
Hi Jason,
I was not actively following this thread but have one comment:
> I don't know if there is a generic API for the NOSQL systems which would
> help for standarization.
To my knowledge, RDF/SPARQL is the only standardized format/protocol
among the NoSQL stores. Unfortunately, its performance and scalability
are not yet comparable to the widely used key-value stores (e.g. Tokyo
Cabinet), however, Semantic Web may have a potential to be a standard
for storing heterogeneous data sets as an integrated biological DB
without designing any schema (we need ontologies instead).
Cheers,
Toshiaki Katayama
On 2011/11/17, at 5:19, Jason Stajich wrote:
> Not to overlly advocate for the NOSQL as I think for our purposes the jury
> is still out. So I think it is worth benchmarking - NOSQL and SQL-based
> systems will have dfferent overheads.
>
> I know when I have tried to store 100M -> 500M records in SQLite the
> performance degrades whereas I was able to store that range of keys in
> NOSQL db without problem.
>
> I don't know if there is a generic API for the NOSQL systems which would
> help for standarization.
>
> Jason Stajich
> jason at bioperl.org
>
>
> On Mon, Nov 14, 2011 at 1:47 PM, Fields, Christopher J <
> cjfields at illinois.edu> wrote:
>
>> On Nov 14, 2011, at 12:14 PM, Peter Cock wrote:
>>
>>> Hi Chris,
>>>
>>> [Did you mean to CC BioPerl-l? Should I have?]
>>>
>>> On Mon, Nov 14, 2011 at 5:59 PM, Fields, Christopher J
>>> <cjfields at illinois.edu> wrote:
>>>> On Nov 13, 2011, at 6:24 AM, Peter Cock wrote:
>>>>
>>>>> So, Chris and I seem in general agreement that an OBDA v2
>>>>> using SQLite but based on essentially the same approach as
>>>>> the BDB or flat file based OBDA v1 is a good idea. i.e. Tables
>>>>> mapping record identifiers to file offsets in the original sequence
>>>>> files.
>>>>
>>>> The worry I have is adhering to a specific backend (e.g. SQLite).
>>>> The reason I say this is b/c BDB in it's time was the go-to way
>>>> of storing simple index data, but that is no longer feasible for
>>>> very large data sets. Who's to say something similar won't
>>>> happen to SQLite, or that it is the best option available?
>>>
>>> Right now I would think SQLite is one of the best (if not the
>>> best) option. If supporting the old back ends is important for
>>> cross-project compatibility, I'm willing to have another go
>>> at using BDB in Biopython, but had limited success last
>>> time I tried.
>>
>> No, I agree re: SQLite at the moment, it's probably the best option (fast,
>> widely adopted, etc), though Jason mentioned (Tokyo|Kyoto)Cabinet also
>> worked very well. I would rather not paint ourselves into a corner if the
>> 'nice-and-shiny' next thing down the road performs better and gains wide
>> adoption.
>>
>>>> Maybe we should focus on the data storage schema, as
>>>> simple as it may be, then indicate the default backend
>>>> must be SQLite but others are allowed (maybe with a
>>>> mention that SQLite can be replaced by alternatives in
>>>> the future if needed).
>>>
>>> It would make sense to talk about an SQL schema if
>>> the "other options" would also be SQL based. But they
>>> might not be... but certainly we should keep potential
>>> alternative back ends in mind.
>>
>> It's probably necessary to allow for both possibilities (SQL and other).
>> For instance, a move to SQLite will necessitate describing the table data
>> with SQL anyway.
>>
>>>>> I hope to get BioRuby on board, they already have an OBDA
>>>>> v1 support so that shouldn't be too hard.
>>>>>
>>>>> Right now I don't recall if BioJava has/had OBDA v1 support,
>>>>> and if they did if it was affected in their recent move to BioJava
>>>>> v3 (I understand from their mailing list that some older lower
>>>>> priority functionality has not all been ported yet).
>>>>
>>>> I wouldn't be surprised at that, OBDA kind of lingered for a
>>>> while, and I'm not sure how widely adopted it became
>>>> (maybe others can shed light on that?)
>>>
>>> Well, OBDA went beyond just indexing flat files - it also
>>> tried to standard things like remote database access.
>>> I don't think we every really had that side working in
>>> Biopython, so I am less familiar with it. I know EMBOSS
>>> has something fairly extensive for online databases,
>>> but have not checked if it uses the OBDA style or their
>>> own.
>>
>> Right, but I wonder if that may have been one problem with the original
>> OBDA specification, that it was perhaps overly ambitious out-the-gate.
>>
>>> For now I was only planning to tackle indexing sequence
>>> files in this "OBDA redux".
>>
>> That's a good and simpler start; the rest (remote access) fall in
>> naturally once that is in place.
>>
>>>>> Also EMBOSS are likely to be interested, certainly Peter Rice
>>>>> was interested in the SQLite indexing we're already using in
>>>>> Biopython for sequence files (i.e. what is effectively the
>>>>> prototype for OBDA v2).
>>>>>
>>>>> Note that in addition to simple indexing of text files, we are
>>>>> already using the same simple offset + length approach for
>>>>> indexing binary files (e.g. SFF).
>>>>
>>>> I think that's the general idea, that is how all bioperl data
>>>> was indexed, before with the Bio::Index modules and with
>>>> the OBDA implementations as well.
>>>
>>> Good.
>>>
>>>>> On the immediate practical side, I think I can edit the
>>>>> current OBDA website of http://obda.open-bio.org/
>>>>> via /home/websites/obda.open-bio.org/html on the
>>>>> server.
>>>>
>>>> See below w/ regards to my thoughts on the wiki.
>>>>
>>>>> We need to work out where the current OBDA indexing
>>>>> specification lives (CVS or SVN?) and perhaps move
>>>>> that to github. We may need a general OBF organisation
>>>>> account on git hub for this and any other cross-project
>>>>> repositories.
>>>>
>>>> +1 to a move to github, but maybe this belongs in an
>>>> OBF-specific organization.
>>>
>>> Yes, definitely under an OBF github account (not under
>>> Biopython, BioPerl, etc).
>>>
>>>> And maybe we should take advantage of the simple
>>>> wiki or project homepage that GitHub offers and move
>>>> everything (docs) there.
>>>
>>> That could work. We'd have to go through all the old
>>> documentation and relocate it, then we could make the
>>> obda.open-bio.org domain point at the github pages.
>>
>> Yes, I think that's the idea.
>>
>>>>> I see there is already an OBDA project on RedMine,
>>>>> (Chris can you add me to that please?)
>>>>> https://redmine.open-bio.org/projects/obda
>>>>>
>>>>> Peter
>>>>
>>>> Done (last night actually, but I didn't have time to respond
>>>> immediately).
>>>>
>>>> chris
>>>
>>> Thanks,
>>>
>>> Peter
>>
>> np.
>>
>> -c
>>
>>
> _______________________________________________
> Open-Bio-l mailing list
> Open-Bio-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/open-bio-l
More information about the Open-Bio-l
mailing list