[Bioperl-l] location binner object

Chris Mungall cjm@fruitfly.org
Wed, 8 May 2002 09:45:14 -0700 (PDT)


I have an IntersectionGraph module in gadfly I was going to port to
bioperl. It doesn't use binning, it could probably be speeded up with a
better algorithm but it's generally sufficient for what I use it for
(usually getting intersection/distance of genes/annotations from some kind
of computational analysis)

My use case is generally finding the overlap of 2 sets (or of all the
elements in one set), so I use 2 sliding windows over the ordered lists of
features. If you're just testing 1 feature vs many this wouldn't be a
great way doing it.

One nice feature is the ability to recurse down eg to get the intersection
at the exon level rather than the gene boundary level.

If one were doing this sort of thing a lot, it would make sense to
index/bin all the features eg some kind of dual R/B tree. But for one-offs
intersection checks, this doesn't make sense, unless you maintain your
entire featureset as this kind of datastructure throughout their
lifecycle. For Jason's use case it seems like it would make sense to do a
one-off index, then do all the tests.

This would be quite a cool way of doing things for some far-in-the-future
bioperl release; right now everything is always maintained as lists, in a
top down hierarchical manner, which is great 99% of the time. but if this
implementation issue could be seperated from the interface,

eg

$bioperl = BioPerl->new
$bioperl->set_FeatureIndex("Rtree")

which would cause all future $sf->add_SubFeature and so forth to track the
feature composition graph in a centralized R tree

one could even switch the R-tree for a DBMS back end (or BDB) and have a
totally transparent persistent mechanism, much like genquire

this would also be good for some kind of listener framework, like biojava
(but with less classes)

anyway, enough rambling

the gadfly version is here if anyone else fancies porting...
http://www.fruitfly.org/developers/src/software/perl-modules/BioModel/IntersectionGraph.pm

..but you're probably better off doing it from scratch

if i get round to doing a bioperl version I'd like to make it use
Inline::C - what's the bioperl policy on Inline::C?

any suggestions for better ways of doing this welcome, i'm not really an
algorithms person


On Wed, 8 May 2002, Jason Stajich wrote:

> I'm generating a bunch of Bio::LocationI objects and would like to test if
> some are within some specified distance away from each other.  I want to
> be able to do fast lookups to see if a location is within some range.
> This seems pretty similar to Lincoln's binning in GFF for locations, but
> would it be possible to do this in-memory/BDB file as I am generating a
> lot of these and don't need to keep them once I've processed and
> identified the best choices?
>
> Essentially I want to be able to take a location from list A and see if
> any locations in list B fall in the range of X bp downstream of A.
> Plenty of implementation options, probably the fastes and easiest to
> implement would be a single vector of length of the full range covered by
> all the locations and have ptrs to the objects in the slots where the
> location overlaps, but this is a memory hog.  I could map to a BDB file
> and just eat the disk space since this is essentially generating a
> tempfile
>
> Anyways, Lincoln do you have any input here - is it just going to be
> easier to slap everything into an sql backend with DB:GFF rather than
> reinventing the wheel or can I basically just run everything through the
> Bio::DB::GFF binning but store in a BDB file?  I'm happy to adapt this to
> some sort of location aggregator/binner object for everyone's use.
>
> Thanks.
>
> -jason
>