[Bioperl-l] feature holder for testing overlaps, etc

Lincoln Stein lstein@cshl.org
Mon, 20 May 2002 17:19:15 -0400


I like Bio::SeqFeature::Collection just fine.  Unfortunately the equivalent 
in the Bio::Graphics module is named Bio::Graphics::FeatureFile, because it 
started out life as a parser for files containing a list of features, and 
then morphed into a generalized collection of features.  Probably time to 
change the name.

Lincoln

On Monday 20 May 2002 16:13, Jason Stajich wrote:
> On Mon, 20 May 2002, Lincoln Stein wrote:
> > Hi Jason,
> >
> > Would it be OK to overlay the DasI interface on top of
> > features_in_range() and get_features()?  Then gbrowse will run on top of
> > it.
>
> That sounds like a great idea.  I'll look at the interface and see what it
> would take to implement it.  Is Bio::SeqFeature::Collection an okay name
> in everyone's mind?
>
> > What if I want to combine those two methods to return features of a
> > particular type that fall inside a particular range?  This is a very
> > common optimization and will greatly help performance if implemented
> > correctly.  The DasI overlapping_features() method works this way.  There
> > are also the following methods:
>
> I was thinking about something like this just this morning -
> perhaps gbrowse could allow a set of features (and their
> associated sequences) to be selected based on a feature range and/or some
> feature metadata like:
>
> $f->has_tag('gene') && grep { /$GENE/ } $f->each_tag_value('gene')
>
> Let's tackle this after I get the range query working.
>
> > 	contained_features()  -- find features that are contained inside range
> > 	contained_in()            -- find features that completely contain a
> > range
> >
> > The way to fetch a range with a B-Tree is to use the DB_File
> > object-oriented seq() method with a cursor of R_CURSOR.  This has to be
> > coupled with a custom indexer that performs a numeric comparison, and the
> > appropriate flags to allow you to fetch duplicate keys.  See the DB_file
> > documentation for examples of this.
>
> Thanks Lincoln.
>
> I almost have it working. I just figured out you probably can't mix
> get_dup calls within the calls to the cursor iterator or else you'll only
> get keys which have >1 value.  I'll commit the code and tests tonight if
> it all works and we can expand from there.
>
> -jason
>
> > Lincoln
> >
> > On Wednesday 15 May 2002 18:45, Jason Stajich wrote:
> > > Here is the proposal for an in-memory SeqFeature collection interface
> > > and object tenatively called Bio::SeqFeature::FeatureCollectionI and
> > > Bio::SeqFeature::Collection - which is analagous to ChrisM's described
> > > IntersectionGraph (maybe it can inheriet from an InterfaceGraphI if
> > > you want to help abstract those methods out).
> > >
> > > SeqFeatureCollectionI interface
> > > methods:
> > > add_features    -- add a set of features to the collection
> > >
> > > features_in_range -- returns a list of features that are contained in
> > > 		     a specified start & end,range or LocationI.
> > > 		     Optionally taking into account strand in the same
> > > 		     way the Range overlap/contains methods do.
> > > 		     Accept a flag as to whether to test for features
> > > 		     that overlap or are completely contained.
> > > get_features(-tag => $tag) - returns a list features that have the
> > > 		     requested tag (this will only be more efficient
> > > 		     than grepping on the list if the # of features is
> > > 		     large.
> > >
> > > It could be reasonable to let Bio::Seq objects use a
> > > SeqFeatureCollection to hold their features depending on the
> > > efficiency here - but one thing at a time.
> > >
> > > Bio::SeqFeature::Collection would be implemeted with a BDB B-Tree and
> > > use Lincoln's bin method from Bio::DB::GFF::Util::Binning.  I'm not
> > > sure how to get things that fall within a range from the BDB B-Tree
> > > interface - have to pull from a sorted list somehow and most of the
> > > examples are for duplicate hash keys, hints appreciated.
> > >
> > > -jason