[Biojava-dev] EnsemblApi use case for DNASequences

Thu May 13 12:44:09 UTC 2010

That all sounds very useful - I'll get started now :) 

> -----Original Message-----
> From: Andy Yates [mailto:andyyatz at gmail.com] On Behalf Of Andy Yates
> Sent: 13 May 2010 13:38
> To: PATERSON Trevor
> Cc: biojava-dev at lists.open-bio.org
> Subject: Re: [Biojava-dev] EnsemblApi use case for DNASequences
> 
> I have to say that from working with Ensembl for the past 2 
> years hearing this is what it does to store sequence scares 
> **** out of me; you've really hit onto the hardest part of 
> the schema there.
> 
> As you said at the end of your email the best way to 
> accomplish this is by creating a SeqeunceProxyReader which 
> can do all this logic and lets you work with the "right" 
> objects and not have to re-implement that code. Now this 
> leaves a few alternatives to how you can represent this in 
> memory. We already have a 2bit implementation (will be called 
> TwoBitSequenceReader) for storing very large pieces of 
> Sequence but that only has support for ACGT and no support 
> for gaps or Ns. This could be extended to bring in support 
> for these as features or you could materialise that sequence 
> and then push it into another Sequence object I have been 
> working with (unchecked in atmo) which lets you join 
> Sequences together. This combined with a Sequence which 
> returns Compounds of a particular type e.g. Ns for any given 
> length would let you represent massive amounts of Sequence in 
> a very small amount of space. All of these updates will be in 
> place soon but I cannot say exactly when
> 
> The other option would be to cache chunks of the DNA indexed 
> by the seq_region_id. Pushing this into a LRU cache with soft 
> references (so they'll be cleaned up when you'd run out of 
> memory) could be a good way to go.
> 
> Either way the simple way really isn't the way to go IMHO; on 
> the flip side it would get you to a prototype quicker. Of 
> course this depends on what type of code you are writing. If 
> it is prototype code then great or if it's what normally 
> happens in Bioinformatics (we claim it's a prototype but in 
> reality it's the real deal) then go with the proper solution
> 
> Andy
> 
> On 13 May 2010, at 13:21, PATERSON Trevor wrote:
> 
> > Perhaps if I describe our initial use case and how we hope 
> to address 
> > it using Ibatis and BioJava API, I can get some pointers on 
> how much 
> > of this is already supported in BioJava, how much I am 
> going to need 
> > to implement and how I would be best doing this to generate 
> useful reusable code.
> > 
> > For each genome assembly build Ensembl stores different 
> levels of DNA 
> > sequence regions, it calls these coordinate_sytems (eg 
> clones, contigs, chromosomes etc).
> > 
> > For each genome assembly there is one 'TopLevel' 
> coordinate_system (eg chromosomes).
> > And one 'SequenceLevel' coordinate_system  (eg contigs). 
> > 
> > Each sequence region in the database records its length and 
> > coordinate_system BUT ONLY DNA regions which are at 'SequenceLevel' 
> > have actual DNA Sequence recorded, all other regions must 
> have their 
> > actual sequence recovered by 'projecting' from their level 
> to DNA regions at  'SequenceLevel'.
> > 
> > 
> > so our initial use case is
> > __________________________
> > 
> > 1. retrieve  Chromosome 25  for Chicken from the database.
> > 
> > What we get back are some properties (Name, coordinateSystemID and 
> > length)
> > - and what we map this to in ibatis is an AssembledSequence 
> Object - 
> > with these properties
> > 
> > 2. fetch the sequence level assmbly details for this Chromosome.
> > 
> > We get back a table mapping from-to coordinates of the chromosome 
> > versus from to coordinates of the contigs that are at Sequence Level
> > 
> > diagramatically this looks like
> > 
> > 
> > 
> <--------------------------------------------------------------> chr25
> > <-->  <---> <----> <--> <--> <----->     <--> <--> <---->  
> <---> contigs
> >       <----->         <-->          <----> <-------> 
> > 
> > you will note that there are
> > 	- overlaps
> > 	- gaps 
> > 	- potentially mismatches ( I am ignoring these for the moment)
> > 
> > 3. to get the DNA sequence, the ensembl perl api stitches 
> together the 
> > contigs into one 'Sequence' - filling gaps with gap 
> sequences of the 
> > correct length, so it generates an ordered list of mappings between 
> > the chromosome coordinate system  and coordinates of 
> contigs and gaps
> > 
> > <-->  <--->   ---> <-->    > <----->       ->        --->  
> <---> contigs
> >           -->         <-->          <---->  -------> 
> >    nn            n         n       n                    nn      gaps
> > 
> > the perl api can then fetch the actual DNA sequence for any 
> region of 
> > the chromosome by looking up the contig regions it needs to 
> fetch the 
> > projected sequence of from this projection map.
> > 
> > Remember that chromosomes, contigs and gaps can all be very 
> long, or very short!
> > 
> > Our Java API
> > ____________
> > 
> > I have mirrored what the perl api does
> > 
> > fetching a chromosome object - which Ibatis instantiates as an 
> > AssembledSequence object, which extends BioJava DNASequence 
> Object - 
> > but obviously just has a couple of new properties set at 
> this time (length, name, coord_system).
> > 
> > fetching an Assembly Object for this Chromosome Object - 
> this contains 
> > an ordered List of Mapping Objects which contain Source (ie the 
> > Chromosome), SourceCoordinates, Target (a new DNASequence 
> Object for 
> > each contig), TargetCoordinates
> > 
> > This Assembly Object can stitch together the Mapping Projection for 
> > all or some of the Chromosome, just like the perl API, 
> creating a new 
> > ordered List of Mapping Objects where the TargetCoordinates are 
> > alterred to remove overlaps, and new GapSequence objects have been 
> > inserted. [Gaps are problematic - do I really want 
> DNASequence Objects 
> > that contain N of length x, allowing me to use the Gaps 
> just like any 
> > other DNASequence but with all the overhead that invloves, 
> or should I 
> > just omit these mappings, or do i set the Target to Null in 
> a Mapping
> > - and then I will need code to handle these wherever I use 
> sequences 
> > that contain null spacers - PERHAPS there is some 
> representation to handle Gaps generically in the BioJava API).
> > 
> > So now I am at the point of fetching actual DNA Sequence 
> for regions 
> > of interest on the Chromosome. This will invlove a look up of the 
> > stitched Mapping List for the contig regions to retieve 
> from Ensembl, and then setting the actual DNA sequence in these.
> > 
> > Hence my simplistic extension of DNA Sequences in the above 
> scenario 
> > falls over because of the Ibatis Bean requirement for setting 
> > properties directly on Objects, whivh i cant work around if 
> the DNASequence objects don't allow for setters.
> > 
> > I'm playing with lots of different ideas - possibly the simplest is 
> > just to forget about extending BioJava DNASequence for my ensembl 
> > objects (chromosomes, contigs)
> > - and just create DNASequences for the 'real' Sequences that I get 
> > back as base strings from ensembl, which would then be 
> contained or referenced in my chromosomes/contig objects etc.
> > I am sure however that this would mean that I end up having to 
> > reimplement much of the BioJava functionality in the new model 
> > Objects, whereas I was hoping to leverage this 
> transparently by simply extending DNASequence.
> > 
> > I guess one of my biggest concerns about extending BioJava to 
> > represent very big sequences is the potential overhead if i 
> have to instantiate them with backing stores containing the 'real'
> > sequences - we are obvioulsy hoping to lazy load 
> (sub)sequences from 
> > ensembl when they are actually needed. We would have to be very 
> > careful to override all the methods that called back to the 
> backing store if we already had the information we needed or 
> could lazy load it, without grabbing the whole sequence.
> > (e.g. the simple case of the chromosome - we have the 
> length from the 
> > initial query - so wouldn't want retrieve it from the 
> backing store).
> > 
> > So probably the correct way of doing things is to Implement 
> our own  
> > SequenceProxyReader for EnsemblAware Sequences to handle 
> lazy loads, 
> > which also provides all of the required backing store 
> functionality. As usual the correct way will turn out to be 
> the most work!
> > 
> > Cheers Trevor
> > --
> > The University of Edinburgh is a charitable body, registered in 
> > Scotland, with registration number SC005336.
> > 
> > 
> > _______________________________________________
> > biojava-dev mailing list
> > biojava-dev at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-dev
> 
> -- 
> Andrew Yates                   Ensembl Genomes Engineer
> EMBL-EBI                       Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
> 
> 
> 
> 
> 
-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.