[Biojava-l] database for biojava
Thomas Down
td2@sanger.ac.uk
Wed, 22 Nov 2000 15:19:57 +0000
Just found this languishing at the end of my INBOX -- sorry...
On Thu, Nov 16, 2000 at 05:43:45PM +1300, McCulloch, Alan wrote:
> Does anybody have any tips on the right approach to setting up a database
> on top of which would sit biojava ?
>
> The platform will be Oracle 8 and I am very keen to NOT do my
> own data model (in the same way I'm keen to not do my own api/object
> design which is why I want to use something like biojava !) - I want to
> use a standard model if possible, if there is such a thing.
>
> Can a relational data model of some sort be derived from biojava ?
It certainly should be possible to build a new relational model
based on BioJava. Out basic model (simple sequence data,
hierarchical features) is really pretty simple -- the only
problems I can see might be:
- Sparse locations -- it'll be a little bit of extra work to
store these in the relational model. I guess I'd go for
having a `span' table:
create table location_span (
location_id int not null,
min_pos int not null,
max_pos int not null
) ;
So each location is modeled by one or more location_span
rows. Of course, the BioJava interfaces don't actually
/require/ you to store sparse locations -- only implement
this if you're actually going to need it.
- Polymorphic features -- I guess the easiest way might be to
have a separate table for each class of Feature object you
want to store, but this means hardwiring the supported
feature classes at a fairly low level. Another approach
would be to have a table like:
create table feature (
id sequence,
sequence_id int not null,
parent_id int,
location_id int not null,
type text,
source text,
biojava_feature blob
) ;
so you're storing the `universal' properties of the feature,
and then serializing the whole feature object and dumping it
in the blob.
But before you start implementing from scratch, you might like
to take a look at what the EnsEMBL people have been doing
(http://www.ensembl.org). They've got a fairly sophisticated
model for storing genomic data in a relational model (currently
using MySQL, but I've had the main tables running on PostgreSQL,
and I know someone is working on an Oracle port). The EnsEMBL
tables are more closely geared towards one specific application
that the BioJava model is, but it might be worth looking to
see if your data will fit into this model.
I've been working on some Java interfaces for EnsEMBL -- all
experimental code at the moment. Feel free to take a look
at the following CVS modules if you're interested (in the main
BioJava repository):
ensembl Lightweight Java wrappers round the ensembl
SQL tables (largely complete for reading, maybe
40-50% done for writing)
biojava-ensembl Bridge which allows EnsEMBL databases to be
viewed as BioJava SequenceDBs (currently
pretty experimental)
Hope this helps,
Thomas.
--
``If I was going to carry a large axe on my back to a diplomatic
function I think I'd want it glittery too.''
-- Terry Pratchett