[Bioperl-l] storing and retrieving partial sequences

Heikki Lehvaslaiho heikki@ebi.ac.uk
Tue, 04 Dec 2001 09:47:40 +0000


Jason Eric Stajich wrote:
> 
> There is the capability for getting a subseq in the bioperl-db
> implementation (bioperl layer on top of mysql).  We don't currently cache
> anything though so each subseq requires a new db call.  However, there
> should be capability there to build your own Bio::Seq::CachingSeq which
> intercepts calls if need be.
> 
> Not sure I totally understand the scenario so not sure if this helps.
> 
> -jason

My quess what Jonathan is after is a way to store a sequence from a genomic
build. bioperl-db could be used to cache the retrieved sequences.
 
The bioperl-db schema would need an additional table holding the ID of the
subsequence, the ID of the main sequence (== biosequence.biosequence_id) and
the range covered within the main sequence. When the next sequence query
comes in, Bio::Seq::CachingSeq could use this biosubsequence table to find
out if a some of it is already in, retrieve it (from several separate
sequences if needed) and calculate how much more is needed. The rest of the
sequnec could then be retrieve from the slow main database.

e.g.:

CREATE TABLE biosubsequence (
  biosubsequence_id	int(10) unsigned NOT NULL \
			PRIMARY KEY auto_increament,
  biosequence_id	int(10) NOT NULL,
  sub_start		int(10) NOT NULL,
  sub_end		int(10) NOT NULL,
  KEY(biosequence_id),
  KEY(sub_start),
  KEY(sub_end)
)


If the cache database has a long lifetime it needs a method to remove
redundant sequences (not in yet). Alternaitvely on could just drom the whole
database and start a new, but that code needs writing, too.

	-Heikki


> On Mon, 3 Dec 2001, Jonathan Epstein wrote:
> 
> > Hi,
> >
> > Does anyone have a good BioPerl or ACEDB way to handle storing and
> > retrieval of partial sequences?
> >
> > The idea is that, say, I might have bp 50001-100000 of a particular
> > sequence which is 500kb long.  I want to cache this local result,
> > since obtaining the other sequence data may be computationally very
> > complex and may even require manual intervention.  So, if subsequently
> > there is a query for bp 56000-60000 I want to retrieve the data
> > immediately from the local cache.  If there is a query for bp
> > 40000--60000 I want to retrieve the cached portion of the data, and
> > set in motion whatever is needed to obtain the missing data.
> >
> > For now we are starting a home-grown mySQL solution, but I really
> > prefer to use a solution which is BioPerl-based or at least
> > BioPerl-like.
> >
> > Can anyone suggest how we might hook into Bio::DB or Bio::Seq or ... ?
> >
> > Thanks,
> >
> > - Jonathan
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l@bioperl.org
> > http://bioperl.org/mailman/listinfo/bioperl-l
> >
> 
> --
> Jason Stajich
> Duke University
> jason@cgt.mc.duke.edu
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l

-- 
______ _/      _/_____________________________________________________
      _/      _/                      http://www.ebi.ac.uk/mutations/
     _/  _/  _/  Heikki Lehvaslaiho          heikki@ebi.ac.uk
    _/_/_/_/_/  EMBL Outstation, European Bioinformatics Institute
   _/  _/  _/  Wellcome Trust Genome Campus, Hinxton
  _/  _/  _/  Cambs. CB10 1SD, United Kingdom
     _/      Phone: +44 (0)1223 494 644   FAX: +44 (0)1223 494 468
___ _/_/_/_/_/________________________________________________________