[Bioperl-l] Bio::Coordinate::Collection could DoWhatIMean better (w/ patch)

George Hartzell hartzell at alerce.com
Thu Sep 25 05:17:54 UTC 2008


Heikki Lehvaslaiho writes:
 > George,
 > 
 > This is an error from my side. Great that you have a fix already.
 > 
 > My only worry is the number of external dependencies in BioPerl. To limit 
 > these we have recoded number of functionalities into BioPerl-specific modules. 
 > Before you commit the fix, could you see if Bio::RangeI could be used or easily 
 > extended to be used instead of Set::IntSpan?
 > 
 > Thanks,
 > 
 >    -Heikki
 > 
 > On Friday 05 September 2008 21:01:35 George Hartzell wrote:
 > > Hi all,
 > >
 > > Bio::Coordinate::Collection surprised me a bit.  At first I thought
 > > there was a bug, but it's clearly doing what it's supposed to.  Now
 > > I'm wondering if what it's supposed to be doing makes sense in some
 > > context, or if what I expected would be better functionality.
 > >
 > > t/CoordinateMapper.t sets up the following scenario:
 > >
 > > #
 > > # Collection
 > > #
 > > #         1   5     6   10
 > > #         |---|     |---|
 > > #-----|-----------------------
 > > #     1   5   9     15  19
 > > #         pair1     pair2
 > >
 > > Then goes on to do the following query:
 > >
 > >   # match more than two
 > >   $pos = Bio::Location::Simple->new (-start => 5, -end => 19 );
 > >   ok $res = $transcribe->map($pos);
 > >   is $res->each_gap, 2;
 > >   is $res->each_match, 2;
 > >
 > > I was surprised to see that there were two gaps, one gene:10-19 and
 > > one from gene:5-14.  Looking at the code, what's really happening is
 > > that, for the exon1 mapper there's match with gene:5-9 and a gap with
 > > gene:10-19 and for the exon2 mapper there's a gap with gene:5-14 and a
 > > match with gene:15-19.  All four Result's just get tossed into the
 > > return value.
 > >
 > > The result my intuition wants is that there are two matches (gene:5-9
 > > with exon1 and gene:15-19 with exon2) and a gap (gene:10-14).
 > >
 > > Yes, I guess that I could just synthesize these myself from the result
 > > in my app.
 > >
 > > It still seems that the current result is a bug though, since there's
 > > no way of knowing when you're walking through $res->each_Location that
 > > the first "gap" is with respect to the exon1 mapper and that the
 > > second "gap" is with respect to the exon2 mapper.  The gaps are
 > > meaningless.
 > >
 > > I "fixed" it to work the way I think it should (two matches, one
 > > gap).  I actually extended the test case a bit so that there's a
 > > multi-base gap, a match, another multibase-gap, another match, then a
 > > single base gap (just to make sure I got that right...).  I had to
 > > touch up the test file a bit to account for my new test.
 > >
 > > The gaps that I return have a strand of 'undef', which seems to be The
 > > Right Thing.  There's also a bit of funny business where I hang onto
 > > the seq_id of the gapped sequence.  It assumes that the "in" sequence
 > > is the same for all of the mappers.  This seems safe since otherwise
 > > the entire query is kind of weird....
 > >
 > > There's a patch to todays svn head at:
 > >
 > >   http://shrimp.alerce.com/bioperl/collection-diffs.txt
 > >
 > > The patch changes Build.PL to include a dependency on Set::IntSpan,
 > > CoordinateMapper.t to update the tests, and
 > > Bio/CoordinateMapper/Collection.pm for the new code.
 > >
 > > Who's code would this break.
 > >
 > > If anyone's relying on the current behaviour re: gaps, what's the
 > > situation in which you find it useful?
 > >
 > > Thanks!
 > >
 > > g.
 > [...]

So, I'm out of round 'tuit's trying to think of an _easy_ way to do
this using Range/RangeI.  I might just not be thinking clearly, and
I've definitely trained myself to solve these kinds of problems with
Set::IntSpan problems so I may just be waving my favorite hammer
around.

There may also be a BioPerl native way to do it using some fancier
object than a RangeI (split locations?).

It's easy to get the set of gaps for each pair, but is there already
something that takes the intersection of a set of non-intersecting
Range's?

g.



More information about the Bioperl-l mailing list