[Bioperl-l] Refactoring Locations...

Lincoln Stein lstein@cshl.org
Fri, 28 Jun 2002 11:36:25 -0400


The suggested refactoring sounds correct.  I prefer IN-BETWEEN to TWEEN or 
TWIXT.

As a meta comment, life would be much easier if positions were described 
(perhaps internally) as zero-based half open intervals, which is the way that 
all sensible graphics code does it (I first learned the concepts working with 
Apple's QuickDraw).  In half-open intervals, the coordinates refer to the 
spaces between the nucleotides, rather than to the nucleotides themselves.  
For the dinucleotide AG, the following mappings hold:

	coordinate		sequence

	(0,1)			A
	(0,2)			AG
	(1,1)			space between A & G

Note that in half-open intervals, the length of the sequence is always end 
minus start, and that you can do coordinate arithmetic withoug adding and 
subtracting 1's.

Lincoln


On Thursday 27 June 2002 12:34 pm, Heikki Lehvaslaiho wrote:
> I ran into a small problem with Bio::Locations and would like to slightly
> refactor them.
>
>  From my point of view there are three types of exact sequence locations
> which in feature table notation are: 23, 34..55 and 46^47. The first two
> are handled by Bio::Location::Simple and have location_type('EXACT'). The
> last one is lumped into location_type('BETWEEN') together with locations
> like 46^78 and handled by Bio::Location::Fuzzy. The source for the
> confusion is that the feature table definition allows for locations like
> 46^78 which I do not think are used anywhere. To stress, notation 46^47 is
> essential when you have clean insertions between residues.
>
>
> Currently we have Bio::LocationI which defines the interface,
> Bio::Location::Simple and two subclasses of Simple: Bio::Location::Fuzzy
> and Bio::Location::Split.
>
> What I'd like to have is to rename the current Simple into Atomic to be a
> common superclass and recreate Bio::Location::Simple so that it can have
> two values for the method location_type(): 'EXACT' and  'IN-BETWEEN'
> ('TWEEN', 'TWIXT' ?). The object will throw an error if location_type() is
> 'TWEEN' and start() and end() are both defined and not adjacent. The length
> of 'TWIXT' location is always zero. The default value of location_type()
> will be 'EXACT'.
>
>
> In practice the code changes seem to be easy to make and there might even
> be slight speed increase: Current Simple does some thing slightly
> convoluted way because methods are inherited by Fuzzy and Split.
> Using Bio::Location::Simple in scripts and other modules is made more
> complicated only if you are conserned about insertions (your should be!).
> You can then test either location_type() or lenght().
>
>
> The only other place in bioperl core outside Bio::Location that I have
> found to be affected is FTHelper.pm where one more condition needs to be
> added.
>
>
> I have almost all the code changes ready for committing.
>
> Any comments?
>
> 	-Heikki

-- 
========================================================================
Lincoln D. Stein                           Cold Spring Harbor Laboratory
lstein@cshl.org			                  Cold Spring Harbor, NY
========================================================================