[Bioperl-l] Refactoring Locations...

Hilmar Lapp hlapp@gnf.org
Mon, 1 Jul 2002 12:02:44 -0700


I largely second this, the exception being that IMO some kinds of external applications will be affected. E.g., I don't think it's the same for a drawing application whether the start of a feature is 0 or 1 ...

Computated properties on locations should be unaffected though, like length, intersection, etc.

Building our integrated gene(ome) annotation database we're kind of concerned with this issue here: one of the questions that hit us already is which coordinate system are we going to use, and how many bugs will we incur by having to translate between the coordinate system of the particular data source and that of bioperl, and then from bioperl to biosql ... Couldn't just everyone use the same? Following that question, if there's no commonly used standard now, which system is most likely to become the standard?

	-hilmar

> -----Original Message-----
> From: Lincoln Stein [mailto:lstein@cshl.org]
> Sent: Monday, July 01, 2002 9:45 AM
> To: Ewan Birney; Chris Mungall
> Cc: Heikki Lehvaslaiho; Bioperl
> Subject: Re: [Bioperl-l] Refactoring Locations...
> 
> 
> I'm going to defend my position, but this will be my last 
> word on the subject 
> (this isn't worth extended discussion or a flame war).
> 
> 0)  Going to space-oriented coordinates makes our code 
> simpler, less buggy, 
> and makes it easier to add new modules.
> 
> 1) If we keep the API the same, then external applications 
> won't need to know 
> we made the change.  The only apps that will break is those 
> that broke 
> encapsulation by going directly to the hash.
> 
> 2) We have to rewrite BioPerl from the ground up next year in 
> any case in 
> order to support perl 6.0.
> 
> Lincoln
> 
> 
> On Saturday 29 June 2002 07:25 am, Ewan Birney wrote:
> > On Fri, 28 Jun 2002, Chris Mungall wrote:
> > > I second this. gadfly works in space-oriented 
> coordinates. you have to be
> > > super-rigorous in import/export but otherwise it's a much 
> better system,
> > > it's ridiculous having to import an awkward fuzzy system 
> for representing
> > > insertions/splice sites etc.
> > >
> > > is it really too late to have us switch to this system? I 
> can't see how
> > > it would be done without extreme pain but I think it'd be 
> worth it in the
> > > end. bioperl2.0?
> >
> > I say no. Really.
> >
> >
> > We have 20 years of legacy in inclusive coordinates. As 
> much as I would
> > love to work in half open coordinates, the number of
> > bugs/misunderstandings and idiocies that will go on is too much.
> >
> >
> > In tight projects (eg Gadfly, my own Wise2 package) where 
> everyone is 100%
> > mind synced, I think one can make the change, and it is 
> much nicer to
> > program in. But in Bioperl, with this loose distribution of 
> people we just
> > can't do it.
> >
> >
> > I vote STRONG no. We stick to what has been 
> published/stored/used for the
> > last 20 years. +1 is not that hard to put in.
> >
> > > On Fri, 28 Jun 2002, Lincoln Stein wrote:
> > > > The suggested refactoring sounds correct.  I prefer 
> IN-BETWEEN to TWEEN
> > > > or TWIXT.
> > > >
> > > > As a meta comment, life would be much easier if positions were
> > > > described (perhaps internally) as zero-based half open 
> intervals, which
> > > > is the way that all sensible graphics code does it (I 
> first learned the
> > > > concepts working with Apple's QuickDraw).  In half-open 
> intervals, the
> > > > coordinates refer to the spaces between the 
> nucleotides, rather than to
> > > > the nucleotides themselves. For the dinucleotide AG, 
> the following
> > > > mappings hold:
> > > >
> > > > 	coordinate		sequence
> > > >
> > > > 	(0,1)			A
> > > > 	(0,2)			AG
> > > > 	(1,1)			space between A & G
> > > >
> > > > Note that in half-open intervals, the length of the 
> sequence is always
> > > > end minus start, and that you can do coordinate 
> arithmetic withoug
> > > > adding and subtracting 1's.
> > > >
> > > > Lincoln
> > > >
> > > > On Thursday 27 June 2002 12:34 pm, Heikki Lehvaslaiho wrote:
> > > > > I ran into a small problem with Bio::Locations and 
> would like to
> > > > > slightly refactor them.
> > > > >
> > > > >  From my point of view there are three types of exact sequence
> > > > > locations which in feature table notation are: 23, 
> 34..55 and 46^47.
> > > > > The first two are handled by Bio::Location::Simple and have
> > > > > location_type('EXACT'). The last one is lumped into
> > > > > location_type('BETWEEN') together with locations like 
> 46^78 and
> > > > > handled by Bio::Location::Fuzzy. The source for the 
> confusion is that
> > > > > the feature table definition allows for locations 
> like 46^78 which I
> > > > > do not think are used anywhere. To stress, notation 46^47 is
> > > > > essential when you have clean insertions between residues.
> > > > >
> > > > >
> > > > > Currently we have Bio::LocationI which defines the interface,
> > > > > Bio::Location::Simple and two subclasses of Simple:
> > > > > Bio::Location::Fuzzy and Bio::Location::Split.
> > > > >
> > > > > What I'd like to have is to rename the current Simple 
> into Atomic to
> > > > > be a common superclass and recreate 
> Bio::Location::Simple so that it
> > > > > can have two values for the method location_type(): 
> 'EXACT' and 
> > > > > 'IN-BETWEEN' ('TWEEN', 'TWIXT' ?). The object will 
> throw an error if
> > > > > location_type() is 'TWEEN' and start() and end() are 
> both defined and
> > > > > not adjacent. The length of 'TWIXT' location is 
> always zero. The
> > > > > default value of location_type() will be 'EXACT'.
> > > > >
> > > > >
> > > > > In practice the code changes seem to be easy to make 
> and there might
> > > > > even be slight speed increase: Current Simple does some thing
> > > > > slightly convoluted way because methods are inherited 
> by Fuzzy and
> > > > > Split. Using Bio::Location::Simple in scripts and 
> other modules is
> > > > > made more complicated only if you are conserned about 
> insertions
> > > > > (your should be!). You can then test either location_type() or
> > > > > lenght().
> > > > >
> > > > >
> > > > > The only other place in bioperl core outside 
> Bio::Location that I
> > > > > have found to be affected is FTHelper.pm where one 
> more condition
> > > > > needs to be added.
> > > > >
> > > > >
> > > > > I have almost all the code changes ready for committing.
> > > > >
> > > > > Any comments?
> > > > >
> > > > > 	-Heikki
> > >
> > > _______________________________________________
> > > Bioperl-l mailing list
> > > Bioperl-l@bioperl.org
> > > http://bioperl.org/mailman/listinfo/bioperl-l
> >
> > -----------------------------------------------------------------
> > Ewan Birney. Mobile: +44 (0)7970 151230, Work: +44 1223 494420
> > <birney@ebi.ac.uk>.
> > -----------------------------------------------------------------
> 
> -- 
> ==============================================================
> ==========
> Lincoln D. Stein                           Cold Spring Harbor 
> Laboratory
> lstein@cshl.org			                  Cold 
> Spring Harbor, NY
> ==============================================================
> ==========
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>