[Biojava-dev] EnsemblApi use case for DNASequences
PATERSON Trevor
trevor.paterson at roslin.ed.ac.uk
Thu May 13 12:44:09 UTC 2010
That all sounds very useful - I'll get started now :)
> -----Original Message-----
> From: Andy Yates [mailto:andyyatz at gmail.com] On Behalf Of Andy Yates
> Sent: 13 May 2010 13:38
> To: PATERSON Trevor
> Cc: biojava-dev at lists.open-bio.org
> Subject: Re: [Biojava-dev] EnsemblApi use case for DNASequences
>
> I have to say that from working with Ensembl for the past 2
> years hearing this is what it does to store sequence scares
> **** out of me; you've really hit onto the hardest part of
> the schema there.
>
> As you said at the end of your email the best way to
> accomplish this is by creating a SeqeunceProxyReader which
> can do all this logic and lets you work with the "right"
> objects and not have to re-implement that code. Now this
> leaves a few alternatives to how you can represent this in
> memory. We already have a 2bit implementation (will be called
> TwoBitSequenceReader) for storing very large pieces of
> Sequence but that only has support for ACGT and no support
> for gaps or Ns. This could be extended to bring in support
> for these as features or you could materialise that sequence
> and then push it into another Sequence object I have been
> working with (unchecked in atmo) which lets you join
> Sequences together. This combined with a Sequence which
> returns Compounds of a particular type e.g. Ns for any given
> length would let you represent massive amounts of Sequence in
> a very small amount of space. All of these updates will be in
> place soon but I cannot say exactly when
>
> The other option would be to cache chunks of the DNA indexed
> by the seq_region_id. Pushing this into a LRU cache with soft
> references (so they'll be cleaned up when you'd run out of
> memory) could be a good way to go.
>
> Either way the simple way really isn't the way to go IMHO; on
> the flip side it would get you to a prototype quicker. Of
> course this depends on what type of code you are writing. If
> it is prototype code then great or if it's what normally
> happens in Bioinformatics (we claim it's a prototype but in
> reality it's the real deal) then go with the proper solution
>
> Andy
>
> On 13 May 2010, at 13:21, PATERSON Trevor wrote:
>
> > Perhaps if I describe our initial use case and how we hope
> to address
> > it using Ibatis and BioJava API, I can get some pointers on
> how much
> > of this is already supported in BioJava, how much I am
> going to need
> > to implement and how I would be best doing this to generate
> useful reusable code.
> >
> > For each genome assembly build Ensembl stores different
> levels of DNA
> > sequence regions, it calls these coordinate_sytems (eg
> clones, contigs, chromosomes etc).
> >
> > For each genome assembly there is one 'TopLevel'
> coordinate_system (eg chromosomes).
> > And one 'SequenceLevel' coordinate_system (eg contigs).
> >
> > Each sequence region in the database records its length and
> > coordinate_system BUT ONLY DNA regions which are at 'SequenceLevel'
> > have actual DNA Sequence recorded, all other regions must
> have their
> > actual sequence recovered by 'projecting' from their level
> to DNA regions at 'SequenceLevel'.
> >
> >
> > so our initial use case is
> > __________________________
> >
> > 1. retrieve Chromosome 25 for Chicken from the database.
> >
> > What we get back are some properties (Name, coordinateSystemID and
> > length)
> > - and what we map this to in ibatis is an AssembledSequence
> Object -
> > with these properties
> >
> > 2. fetch the sequence level assmbly details for this Chromosome.
> >
> > We get back a table mapping from-to coordinates of the chromosome
> > versus from to coordinates of the contigs that are at Sequence Level
> >
> > diagramatically this looks like
> >
> >
> >
> <--------------------------------------------------------------> chr25
> > <--> <---> <----> <--> <--> <-----> <--> <--> <---->
> <---> contigs
> > <-----> <--> <----> <------->
> >
> > you will note that there are
> > - overlaps
> > - gaps
> > - potentially mismatches ( I am ignoring these for the moment)
> >
> > 3. to get the DNA sequence, the ensembl perl api stitches
> together the
> > contigs into one 'Sequence' - filling gaps with gap
> sequences of the
> > correct length, so it generates an ordered list of mappings between
> > the chromosome coordinate system and coordinates of
> contigs and gaps
> >
> > <--> <---> ---> <--> > <-----> -> --->
> <---> contigs
> > --> <--> <----> ------->
> > nn n n n nn gaps
> >
> > the perl api can then fetch the actual DNA sequence for any
> region of
> > the chromosome by looking up the contig regions it needs to
> fetch the
> > projected sequence of from this projection map.
> >
> > Remember that chromosomes, contigs and gaps can all be very
> long, or very short!
> >
> > Our Java API
> > ____________
> >
> > I have mirrored what the perl api does
> >
> > fetching a chromosome object - which Ibatis instantiates as an
> > AssembledSequence object, which extends BioJava DNASequence
> Object -
> > but obviously just has a couple of new properties set at
> this time (length, name, coord_system).
> >
> > fetching an Assembly Object for this Chromosome Object -
> this contains
> > an ordered List of Mapping Objects which contain Source (ie the
> > Chromosome), SourceCoordinates, Target (a new DNASequence
> Object for
> > each contig), TargetCoordinates
> >
> > This Assembly Object can stitch together the Mapping Projection for
> > all or some of the Chromosome, just like the perl API,
> creating a new
> > ordered List of Mapping Objects where the TargetCoordinates are
> > alterred to remove overlaps, and new GapSequence objects have been
> > inserted. [Gaps are problematic - do I really want
> DNASequence Objects
> > that contain N of length x, allowing me to use the Gaps
> just like any
> > other DNASequence but with all the overhead that invloves,
> or should I
> > just omit these mappings, or do i set the Target to Null in
> a Mapping
> > - and then I will need code to handle these wherever I use
> sequences
> > that contain null spacers - PERHAPS there is some
> representation to handle Gaps generically in the BioJava API).
> >
> > So now I am at the point of fetching actual DNA Sequence
> for regions
> > of interest on the Chromosome. This will invlove a look up of the
> > stitched Mapping List for the contig regions to retieve
> from Ensembl, and then setting the actual DNA sequence in these.
> >
> > Hence my simplistic extension of DNA Sequences in the above
> scenario
> > falls over because of the Ibatis Bean requirement for setting
> > properties directly on Objects, whivh i cant work around if
> the DNASequence objects don't allow for setters.
> >
> > I'm playing with lots of different ideas - possibly the simplest is
> > just to forget about extending BioJava DNASequence for my ensembl
> > objects (chromosomes, contigs)
> > - and just create DNASequences for the 'real' Sequences that I get
> > back as base strings from ensembl, which would then be
> contained or referenced in my chromosomes/contig objects etc.
> > I am sure however that this would mean that I end up having to
> > reimplement much of the BioJava functionality in the new model
> > Objects, whereas I was hoping to leverage this
> transparently by simply extending DNASequence.
> >
> > I guess one of my biggest concerns about extending BioJava to
> > represent very big sequences is the potential overhead if i
> have to instantiate them with backing stores containing the 'real'
> > sequences - we are obvioulsy hoping to lazy load
> (sub)sequences from
> > ensembl when they are actually needed. We would have to be very
> > careful to override all the methods that called back to the
> backing store if we already had the information we needed or
> could lazy load it, without grabbing the whole sequence.
> > (e.g. the simple case of the chromosome - we have the
> length from the
> > initial query - so wouldn't want retrieve it from the
> backing store).
> >
> > So probably the correct way of doing things is to Implement
> our own
> > SequenceProxyReader for EnsemblAware Sequences to handle
> lazy loads,
> > which also provides all of the required backing store
> functionality. As usual the correct way will turn out to be
> the most work!
> >
> > Cheers Trevor
> > --
> > The University of Edinburgh is a charitable body, registered in
> > Scotland, with registration number SC005336.
> >
> >
> > _______________________________________________
> > biojava-dev mailing list
> > biojava-dev at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-dev
>
> --
> Andrew Yates Ensembl Genomes Engineer
> EMBL-EBI Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/
>
>
>
>
>
--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
More information about the biojava-dev
mailing list