[BioSQL-l] bioperl-db performance

Thu Nov 13 03:54:40 EST 2003

Hi,

I have seen other mails concerning a relatively
low throughput of sequences during storage (with load_seqdatabase.pl). I have 
the same problem with the latest bioperl-db, bioperl 1.2.3, 
perl 5.8.1, RedHat 9 (with newly compiled perl to avoid the utf-8 problems in rh9).
We have tested various RDBMS: MySQL 3.23.54a, MySQL 4.0.16 and 
Oracle 9.2.0.4 on different machines with 1-2 CPUs 2.5GHz P4, 1-2 Gb mem. and 
lots of disk space. But no matter what, the throughput is about
5 sequences per second. If I understand the benchmarks correct the
expected throughput is about 60 seqs on a computer half as fast.
If I start several jobs on separate machines to upload sequences
to a common db (MySQL 4.0.16), the throughput scales perfectly so the RDBMS is not
the bottleneck. 

I did the same test with biojava 1.3 + MySQL 3.23.54a (but with an older version
of BioSQL that matches biojava) and there the throughput matches the 
benchmark (about 50 seqs per second).

If I do some profiling of a 10 seq genbank file with perl -d:Dprof load_seqdatabase.pl ...
The output from dprofpp tmon.out is:

%Time ExclSec CumulS #Calls sec/call Csec/c  Name
 50.2   1.084  1.084 150388   0.0000 0.0000  overload::mycan
 12.2   0.264  1.411   4208   0.0001 0.0003  Carp::caller_info
 4.22   0.091  1.502    210   0.0004 0.0072  Carp::ret_backtrace
 3.85   0.083  1.624   3600   0.0000 0.0005  Bio::DB::BioSQL::BasePersistenceAd
                                             aptor::_create_persistent
 3.24   0.070  0.147   6174   0.0000 0.0000  Bio::DB::Persistent::PersistentObj
                                             ect::AUTOLOAD
 3.10   0.067  1.134   5643   0.0000 0.0002  overload::StrVal
 3.10   0.067  0.067  11722   0.0000 0.0000  UNIVERSAL::isa
 2.41   0.052  0.086   1081   0.0000 0.0001  Bio::DB::Persistent::PersistentObj
                                             ect::can
 2.22   0.048  0.661    151   0.0003 0.0044  Bio::Root::Root::_load_module
 1.71   0.037  0.083    140   0.0003 0.0006  Bio::DB::BioSQL::SimpleValueAdapto
                                             r::add_association
 1.67   0.036  0.036   1845   0.0000 0.0000  Bio::Root::RootI::_rearrange
 1.25   0.027  1.609   3191   0.0000 0.0005  Bio::DB::BioSQL::BasePersistenceAd
                                             aptor::_process_child
 1.20   0.026  1.772   1669   0.0000 0.0011  Bio::DB::BioSQL::BasePersistenceAd
                                             aptor::create_persistent
 1.20   0.026  0.026   2118   0.0000 0.0000  UNIVERSAL::can
 1.16   0.025  0.033   1422   0.0000 0.0000  Bio::Root::Root::new

It seams like a lot of time is spent on creating objects. Is my system
wrongly configured or am I doing something else wrong?

Regards, Dennis
================================
Dennis Madsen, Ph.D.
Scientific Computing, Bioinformatics Group
Novo Nordisk Park, A2P
2760 Måløv
Denmark
================================