[Bioperl-l] storing and retrieving partial sequences

Chris Mungall cjm@fruitfly.bdgp.berkeley.edu
Wed, 5 Dec 2001 18:49:16 -0800 (PST)


If you're using postgres there are a number of optimisations to consider -

you can use the native range type to store the coordinates, this is
presumably optimised for this sort of thing. you can use triggers to keep
this automatically populated from a traditional start/end column design.

you could also keep a table of overlaps, and use triggers again to make
sure this is up to date. your inserts will be slower of course.

the ultimate solution is to write your own seq_feature class in C and make
it a postgres type - i don't think this approach would find favour with
many though. kind of challenging/tedious too, depending on your
perspective.

of course none of these solutions would be transferable to a mysql db. 

then again, i'm wondering why indexed start/end columns should be so slow
for a bacterial genome. maybe there is something funny about the indexing
(it shouldnt have to do a sequential scan), or maybe postgres is just
slower than mysql? other than the speed issue, which is still open, i
think postgres could be better than mysql as an open source bioinformatics
dbms.

On Tue, 4 Dec 2001, Robson Francisco de Souza wrote:

> 
> 	Hello,
> 
> 	I beleive this message is a bit off topic, but maybe some ideias
> about my problem could help developing bioperl-db.
> 	I have a problem that is similar to the one described by Jonathan,
> though I'm not using bioperl-db. I built a PostgreSQL database to hold
> annotation features from a complete bacterial genome sequence. Each
> feature has it's own table and in every table there is a pair of
> coordinates describing where is the feature (start, end). My problem is to
> find overlapping features, like all clones covering a certain gene. But,
> at least in PostgreSQL, when I make a search, all coordinates comparisons
> (which are done by using >=, <=, > and <) must do a sequential scan on the
> table of overlapping features. That is slow, and may became too slow!
> 	Now, do you guys think this searchs could be performed faster with
> a different design or is that a problem that will probably affect any
> design (including bioperl-db: note Heikki's proposal for vioperl-db)?
> 	Sorry if that is too database specific and a bit off topic.
> 	Cheers,
> 			Robson
> 
> 
> On Tue, 4 Dec 2001, Heikki Lehvaslaiho wrote:
> 
> > Jason Eric Stajich wrote:
> > > 
> > > There is the capability for getting a subseq in the bioperl-db
> > > implementation (bioperl layer on top of mysql).  We don't currently cache
> > > anything though so each subseq requires a new db call.  However, there
> > > should be capability there to build your own Bio::Seq::CachingSeq which
> > > intercepts calls if need be.
> > > 
> > > Not sure I totally understand the scenario so not sure if this helps.
> > > 
> > > -jason
> > 
> > My quess what Jonathan is after is a way to store a sequence from a genomic
> > build. bioperl-db could be used to cache the retrieved sequences.
> >  
> > The bioperl-db schema would need an additional table holding the ID of the
> > subsequence, the ID of the main sequence (== biosequence.biosequence_id) and
> > the range covered within the main sequence. When the next sequence query
> > comes in, Bio::Seq::CachingSeq could use this biosubsequence table to find
> > out if a some of it is already in, retrieve it (from several separate
> > sequences if needed) and calculate how much more is needed. The rest of the
> > sequnec could then be retrieve from the slow main database.
> > 
> > e.g.:
> > 
> > CREATE TABLE biosubsequence (
> >   biosubsequence_id	int(10) unsigned NOT NULL \
> > 			PRIMARY KEY auto_increament,
> >   biosequence_id	int(10) NOT NULL,
> >   sub_start		int(10) NOT NULL,
> >   sub_end		int(10) NOT NULL,
> >   KEY(biosequence_id),
> >   KEY(sub_start),
> >   KEY(sub_end)
> > )
> > 
> > 
> > If the cache database has a long lifetime it needs a method to remove
> > redundant sequences (not in yet). Alternaitvely on could just drom the whole
> > database and start a new, but that code needs writing, too.
> > 
> > 	-Heikki
> > 
> > 
> > > On Mon, 3 Dec 2001, Jonathan Epstein wrote:
> > > 
> > > > Hi,
> > > >
> > > > Does anyone have a good BioPerl or ACEDB way to handle storing and
> > > > retrieval of partial sequences?
> > > >
> > > > The idea is that, say, I might have bp 50001-100000 of a particular
> > > > sequence which is 500kb long.  I want to cache this local result,
> > > > since obtaining the other sequence data may be computationally very
> > > > complex and may even require manual intervention.  So, if subsequently
> > > > there is a query for bp 56000-60000 I want to retrieve the data
> > > > immediately from the local cache.  If there is a query for bp
> > > > 40000--60000 I want to retrieve the cached portion of the data, and
> > > > set in motion whatever is needed to obtain the missing data.
> > > >
> > > > For now we are starting a home-grown mySQL solution, but I really
> > > > prefer to use a solution which is BioPerl-based or at least
> > > > BioPerl-like.
> > > >
> > > > Can anyone suggest how we might hook into Bio::DB or Bio::Seq or ... ?
> > > >
> > > > Thanks,
> > > >
> > > > - Jonathan
> > > >
> > > > _______________________________________________
> > > > Bioperl-l mailing list
> > > > Bioperl-l@bioperl.org
> > > > http://bioperl.org/mailman/listinfo/bioperl-l
> > > >
> > > 
> > > --
> > > Jason Stajich
> > > Duke University
> > > jason@cgt.mc.duke.edu
> > > 
> > > _______________________________________________
> > > Bioperl-l mailing list
> > > Bioperl-l@bioperl.org
> > > http://bioperl.org/mailman/listinfo/bioperl-l
> > 
> > -- 
> > ______ _/      _/_____________________________________________________
> >       _/      _/                      http://www.ebi.ac.uk/mutations/
> >      _/  _/  _/  Heikki Lehvaslaiho          heikki@ebi.ac.uk
> >     _/_/_/_/_/  EMBL Outstation, European Bioinformatics Institute
> >    _/  _/  _/  Wellcome Trust Genome Campus, Hinxton
> >   _/  _/  _/  Cambs. CB10 1SD, United Kingdom
> >      _/      Phone: +44 (0)1223 494 644   FAX: +44 (0)1223 494 468
> > ___ _/_/_/_/_/________________________________________________________
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l@bioperl.org
> > http://bioperl.org/mailman/listinfo/bioperl-l
> > 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>