[Bioperl-l] Bio::DB::GFF::Util::Binning

Lincoln Stein lincoln.stein at gmail.com
Tue Oct 31 23:18:07 UTC 2006


Hi Keith,

The current Bio/DB/GFF/Util/Binning.pm file just contains the hierarchical
binning system that I implemented some time ago. Where is the R-tree system
that you describe? How much of an improvement did the R-tree scheme give
over the hierarchical scheme?

FTYI the GFF3 implementation uses a different binning scheme in which there
is a fixed-size bin. Every time a feature overlaps a bin, it creates a new
row in a table. So big features will have multiple rows and little features
that fit inside a bin will have only one row. The query for this is simpler
and seems to give the same relative speedup as the hierarchical binning
system. I'd really like to get these queries to go as fast as possible and
would love to work with you on this if you're interested.

Lincoln

On 10/19/06, Keith Player <keithplayer at hotmail.com> wrote:
>
> I know that there may be some changes resulting from new GFF3
> implementations,
> but thought I would see if the following is useful anyway.
>
> I implemented the R-tree binning schema as used by
> Bio::DB::GFF::Util::Binning
> and as mention in this article:
>
> I tested the following query on a normal table (no binning), but it
> assumes
> that you know the longest range in the table.  So for example with a table
> of
> human genes, where the longest gene we know of is around 2.4Mb.
>
> SELECT COUNT(*) as count FROM groups WHERE start > max(0,[start-2.4Mb])
> AND
> g.start < [end] AND g.end > [start] AND g.chromosome = '1'
>
> so for 100Mb:101Mb
>
> SELECT COUNT(*) as count FROM groups WHERE start > 97600000 AND g.start <
> 101000000 AND g.end > 100000000 AND g.chromosome = '1'
>
>
> where [start] and [end] define the region of interest.  This query
> outperforms
> the R-Tree implementation on all tests that I have performed (for lengths
> of
> 200bp to 10Mb across a whole chromsome).  Could this be of some practical
> use?
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>



-- 
Lincoln D. Stein
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
(516) 367-8380 (voice)
(516) 367-8389 (fax)
FOR URGENT MESSAGES & SCHEDULING,
PLEASE CONTACT MY ASSISTANT,
SANDRA MICHELSEN, AT michelse at cshl.edu



More information about the Bioperl-l mailing list