[Bioperl-l] Initial benchmarking of Bio::DB::GFF3

Thu Mar 9 20:35:34 EST 2006

Hi All,

I've completed some early benchmarking on the latest iteration of the 
Bio::DB::GFF3 module. What distinguishes this module from the original 
Bio::DB::GFF, in addition to its ability to correctly handle the multiple 
levels of containment in GFF3, is that while there are relational tables for 
the feature location, name, attributes and type that are used for querying, 
but the feature itself and all its subparts are instantiated as one 
Bio::SeqFeatureI object at load time, then serialized (using Storable or 
Data::Dumper) and stored into a relational table as a BLOB. Another change is 
that the "binning" scheme now uses integers rather than floats; this will 
avoid the precision problems that have plagued users of different MySQL 
versions.

This means that it will take longer to load the database, but less time to 
retrieve objects, because all the Bio::SeqFeature object creation was done up 
front. It also means that there are fewer objects in the database because a 
gene, its transcripts, and all its exons are all stored as a single object 
rather than as multiple objects that need to be aggregated together at fetch 
time.

Here are the benchmarking results:

DATA SET: 2,849 genes (along with associated data) from 
  C. elegans chromosome I

LOAD TESTS:
 Bio::DB::GFF (bp_bulk_load_gff.pl): 54.58s, 13M database
 Bio::DB::GFF3 (perl DBI loading):     245.06s, 11M database

RETRIEVE TESTS: (fetch 1000 random genes)
 Bio::DB::GFF:  16.81s
 Bio::DB::GFF3: 1.99s

So there's about an 8x speedup in retrieval, but a 4x slowdown in loading, 
which is pretty much what I expected. Unexpectedly, the storage size for the 
data is actually smaller for the Bio::DB::GFF3 database than for 
Bio::DB::GFF.

This looks pretty good to me. My plan now is to experiment with a variation of 
the scheme in which each subfeature is stored as a separate BioPerl object 
and then loaded in a lazy fashion as needed. This will mean that there will 
be as many as three database fetches to get a full gene, but it also allows 
one to ignore genes and just do queries for exons, UTRs, etc. Things that 
have split locations -- such as alignments -- will continue to be stored as a 
single object, however.

Right now I'm still adjusting the names of the various modules so they are in 
my private CVS. I'll move everything to bioperl-live as soon as the names 
stabilize.

Lincoln

-- 
Lincoln Stein
lstein at cshl.edu
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
(516) 367-8380 (voice)
(516) 367-8389 (fax)