[Bioperl-l] bioperl-db

Ewan Birney birney@ebi.ac.uk
Thu, 13 Jul 2000 10:26:06 +0000 (GMT)


On Tue, 11 Jul 2000, Jason Stajich wrote:

> On Tue, 11 Jul 2000, Fernan Aguero wrote:
> 
> > I have some newbie questions about the following:
> > 
> > >bioperl-db is an effort to provide sequence database access and support
> > >for updateable sequences (and the annotations) outside of the core bioperl
> > >read-only seq db support.
> > >
> > >Currently this is being implemented using the Ensembl db structure (with
> > >some additional tables) on a mysql db server.  The ultimate goal for
> > >ensembl-lite (in my mind at least) is a reasonable framework for
> > >small/midsized laboratories to store their genomic data and access to the
> > >analysis pipelines (a lite-r version than the standard ensembl pipeline). 
> > >When the DAS standard is completed we would also like to make an
> > >ensembl-lite a DAS server.
> > >
> > >AceDB support as well would be nice to allow users to access data in the
> > >ensembl-lite system transparently if it is in the mysql db, and acedb
> > >file/server, or a remote web db (GenBank, others ... ).  Other ideas or
> > >additions will be welcome.
> > >
> > >I am in the mid stages of the updatableseqdb implementation, and still
> > >designing the rest of the structures.  Suggestions, volunteers, support is
> > >welcome of course.
> > 
> > 
> > We are currently trying to implement some automated framework to do 
> > analysis on sequences. Our approach uses PostgreSQL and some Perl 
> > scripts to get sequence data in the database, do the analysis on them 
> > and store results again in the DB.
> 
> > i) is this what ensembl does?
> 
> yes, no, maybe... The ensembl analysis pipeline works but is geared
> towards heavy duty analysis not some simple BLASTing.  Through my
> discussions with Ewan I understand that it would make more sense to have a
> lite pipeline that works for smaller things.  Ewan can certainly give a
> better description of what it does/does not do.  I'd obviously like to see
> a wealth of sequence analysis, annotation, and prediction software as part
> of the pipeline, whether or not that really feasible will depend on the
> number of hands on deck.

Ensembl (www.ensembl.org) is a full-blown genome sequence management
database. It is designed to scale well for things like human/mouse (human
is what we run it on at the moment), with large data sets, lots of
changing data, a complex analysis process and viewing a fragmented genome
as if it is actually linear.

Ensembl has alot of technology and code worrying about things that "i just
want to analyse these 200 sequences" is really not appropiate for. That's
why Jason has started on the bioperl-db, using design from Ensembl. (jason
talking things over heavily with myself and michele first).

I hope that bioperl-db and ensembl will be good bedfellows going forwards
and learning from each other as well code share where appropiate.



>  
> > ii) what is the developing status of bioperl-db?
> 
> Would like to finish off Bio::EnsemblLite::UpdateableDB in the next few
> days.  I don't want to check in unfinished code just yet...

;)

> 
> But if there is definitely interest, I can finish up my design document
> and put it up for discussion (Thinking I really want wikki for this, maybe
> we'll put it on Ensembl wikki if that is okay Ewan?)

Definitely. We need a bioperl wiki. Something to bring up at BOSC.

> 
> Basically, I have proposed the UpdateableSeqI for sequences that are
> 'changeable' in contrast to read-only databases.  The next step is running
> analysis on these seqs and capturing those results.  Methods for
> annotating would fall into this as well.  Some of these things are solved
> by ensembl, some are not, finding that line has been hard for me, but I
> think I am beginning to see the big picture...
> 

Jason - you should certainly copy the runnable/runnabledb system we have
in the Ensembl analysis pipeline system. This will allow all Ensembl's
pipeline components to be run in your system.

> > iii) any reasons to choose MySQL instead of PostgreSQL? I know that 
> > the first is faster than the latter...any other?
> > We have settled ourselves with PostgreSQL due to its ability to do 
> > subqueries. Although we don't have any subqueries now that we need to 
> > do, we thought that this capability could be useful. Any comments on 
> > this?
> 
> MySQL was chosen because that is the db the Ensembl group chose and I want
> this to match up with their work as much as possible so we can take
> advantage of their code when appropriate.  There is work underway to port
> the underlying db connection in Ensembl to a more generic framework so
> that multiple dbs can be supported more easily. I'll be working on the
> Sybase port when we agree on a object model for this, I'm sure a Postgres
> port can be included as well if someone wants to tackle it.
> 

MySQL goes faster, handles large data just fine and is easy to administer.

lack of transactions and subselects are a bore, but you can get around
them.

> > 
> > And a proposal:
> > My experience with Perl is limited, although i can usually get away 
> > with what i want to do. If this description fits a volunteer, we can 
> > start talking about what I can do for bioperl-db. Or maybe i can help 
> > with some other task...
> >
> 
> How about this.  I'll have first draft of the in progress EnsemblLite
> UpdateableDB code done by Friday - I'll write up a design doc for what I
> see needs to be worked on, and we can see what the interest level is for
> volunteers and helpers, people can add to the document and we'll see
> where it takes us. 
> 
> BTW: I just checked  the makefile, readme, and the sql code.  I will put
> the EnsemblLite in as soon as 1st try at implementation is finished. 
> 
> -Jason
> 
> > 
> > Fernan
> > -- 
> > 
> > 
> > 
> > Lic. Fernan Aguero                                        Tel: 
> > (54-11) 4752-0021
> > Instituto de Investigaciones Biotecnologicas              Fax: 
> > (54-11) 4752-9639
> > Universidad Nacional de General San Martin
> > 
> 
> Jason Stajich
> Center for Human Genetics
> Duke University Medical Center
> jason@chg.mc.duke.edu
> (919)684-1806 (office)
> (919)684-2275 (fax)
> http://wwwchg.mc.duke.edu/
> http://galton.mc.duke.edu/~jason/
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>