[BioSQL-l] Consistency between bio* projects

mark.schreiber at group.novartis.com mark.schreiber at group.novartis.com
Sun Jan 16 21:41:15 EST 2005


It would seem that what is needed is a mapping of each field from a file 
format to a field in a BioSQL table. I think initially this would only 
need to be done for EMBL, SwissProt and GenBank.

In many ways I prefer the idea of developing a SQL API which would be more 
robust and would serve to define what is expected of each proceedure call. 
However I think it should be achievable for the schema. In fact there is 
no reason why both cannot co-exist. For any API there should be a possbile 
implementation so naturally the schema could be used to generate an API. 
People could then happilly make other schemata that fit the API which may 
be optimised for their needs.

Does anyone have a recent UML or similar diagram for the schema? I can 
then use this to suggest mappings from GenBank fields to the API. I think 
it may be easier in many cases to follow bioperl's lead. BioJava seems to 
follow the 'store everything that isn't a feature as a bioentry_qualifier' 
approach so I just need to add some special cases.

Hilmar, would you be prepared to do any work on the BioPerl side for 
synchronization of the two?

- Mark





Hilmar Lapp <hlapp at gnf.org>
01/15/2005 01:58 AM

 
        To:     Mark Schreiber/GP/Novartis at PH
        cc:     biosql-l at open-bio.org
        Subject:        Re: [BioSQL-l] Consistency between bio* projects



On Friday, January 14, 2005, at 01:10  AM, 
mark.schreiber at group.novartis.com wrote:
>  Unfortunately, Bioperl stores identifiers as
> follows:
>
> Bioentry.bioentry_id is the unique internal reference number
> Bioentry.name is the GI number

The GI number goes to Bioentry.Identifier, which is was designated the 
purpose of storing the identifier within an external database.

Bioentry.name should hold the locus name, which for contigs and many 
other entries etc will be identical to the accession (but not the GI 
number!).

If you find it in Bioentry.name then I suspect you weren't loading from 
genbank or embl formatted input?

 From memory the basic idea of BioSQL was to define a schema that bio*
> projects could both read and write from in a language independant 
> manner.
> For reasons best left to the designers (mostly I think cause MySQL
> couldn't handle stored proceedures) the level of interaction is right 
> down
> at the schema level.

Right. Also, not all database drivers in all languages support stored 
procedure calls equally well. In e.g. PostgreSQL and Oracle you can 
always get around this by writing a view and putting an INSTEAD OF 
INSERT (or UPDATE) trigger on it that will then call the procedure, but 
this is clearly not even close to an option in MySQL.

It's maybe worth considering whether opening a dichotomy here between 
MySQL and the rest to provide people who need it with a SQL-level API 
that both perl and java will use. People who are interested in this by 
definition will not be interested in MySQL anyway.

>  Unfortunaltey this means that the way data is stored
> needs to be very consistent between projects if any API's that use 
> BioSQL
> can be portable. My biojava API cannot be applied to a DB previously 
> setup
> with bioperl which was the original idea behind BioSQL in the first 
> place.
>
> Help!!

I think you're raising a great point. Indeed, such a contract hasn't 
really been written. We're probably one of few who use both perl and 
java to access a biosql database (and I'm not using biojava as the 
object model on the java side, which is why I'm not running into this 
problem). (Note as an aside that you could also write adaptors that 
transform between the SymGene and the Biojava model when storing or 
retrieving objects from/to the database.)

It'd be great if you were willing to take the lead for getting this all 
spelled out and laid down in a document?

                 -hilmar
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------







More information about the BioSQL-l mailing list