[Bioperl-l] bioperl-db performance: load_seqdatabase.pl throughput speed

Hilmar Lapp hlapp at gnf.org
Tue May 11 18:38:30 EDT 2004


With this little amount of memory you'll very quickly run into memory 
contention issues. I'm not exactly sure about the memory caching model 
of mysql, but AFAIR it will use its own memory pool (like Oracle does) 
rather than deferring that heavily to the OS (like Pg). Biosql-db 
caches a lot, specifically it caches species, ontology terms, and 
dbxrefs. With a database like swissprot that is very diverse in terms 
of species (you've got several thousand in there) and richly annotated 
with dbxrefs, you'll end up with *a lot* of memory for the loading 
script alone. Last time I uploaded swissprot (as Uniprot) I had about 
700MB consumed by the loading script process. The db server was running 
on another machine ...

I typically see 4-10 seqs/second on a 1.8GHz CPU (1GB memory) against 
an Oracle database running on an even faster machine.

So, my conclusion as to what most likely you saw happening is 
bioperl-db being slow at the start because nothing is cached yet 
(species lookups are expensive if not cached), and not getting any 
better because it got more and more compounded by memory as well as 
disk I/O contention.

If you really want this to fly, either get a really fast CPU, or better 
yet, two CPUs, and at least two disks. If you want the loader to run on 
the same machine as the db process, then get 3 disks if you can 
(sequence source file on one, db transaction log on the second, db data 
files on the third). And get no less than 1GB of RAM; if you want db 
and loader on the same box get at least 2GB.

	-hilmar

On Tuesday, May 11, 2004, at 10:27  AM, Henry R Bigelow wrote:

> Hi,
> 	my name is Henry Bigelow and I recently installed bioperl-1.4,
> bioperl-db, dbi and dbd-mysql, mysql-4.0 (with InnoDB enabled),
> biosql-schema, and instantiated biosqldb-mysql.sql.  i've successfully
> loaded some sequences of release43.dat, the swissprot flat file, but 
> the
> throughput is roughly 1 sequence every 5 to 10 seconds, on a 
> (admittedly
> slow) 400 Mhz 2 CPU Pentium III with 256 Mb memory.  I ran the command:
>
> perl load_seqdatabase.pl --host localhost --dbname bioseqdb --namespace
> swissprot --dbuser bigelow --dbpass XXX --driver mysql --format swiss
> /data/swissprot/release43.dat
>
>
> I also ran it (on a set of 15 swissprot entries) with a profiler:
>
> perl -d:DProf load_seqdatabase.pl ...
> then with
> dprofpp -u
> i got this:
>
> %Time ExclSec CumulS #Calls sec/call Csec/c  Name
>  9.62   0.800  0.985  15282   0.0001 0.0001
> Bio::DB::Persistent::PersistentObject::isa
>  9.54   0.793  1.403  11909   0.0001 0.0001
> Bio::DB::Persistent::PersistentObject::AUTOLOAD
>  9.25   0.769  3.152   8888   0.0001 0.0004
> Bio::DB::BioSQL::BasePersistenceAdaptor::_create_persistent
>  4.69   0.390  2.922   7733   0.0001 0.0004
> Bio::DB::BioSQL::BasePersistentAdaptor::_process_child
>  4.59   0.382  0.382  26865   0.0000 0.0000
> Bio::DB::Persistent::PersistentObject::obj
>  3.84   0.319  0.319  32822   0.0000 0.0000  UNIVERSAL::isa
>  3.69   0.307  0.372     86   0.0036 0.0043
> Bio::DB::BioSQL::ReferenceAdaptor::_crc64
>  3.28   0.273  1.195    258   0.0011 0.0046  
> Bio::Root::Root::_load_module
>  2.80   0.233  3.545   5465   0.0000 0.0006
> Bio::DB::BioSQL::BasePersistenceAdaptor::create_persistent
>  2.74   0.228  0.228    291   0.0008 0.0008  
> Bio::Root::RootI::stack_trace
>  1.92   0.160  0.160   1794   0.0001 0.0001  DBI::st::execute
>  1.84   0.153  0.534   1608   0.0001 0.0003
> Bio::DB::Persistent::PersistentObject::new
>  1.80   0.150  0.150   7215   0.0000 0.0000
> Bio::DB::Persistent::PersistentObject::primary_key
>  1.74   0.145  0.185   2640   0.0001 0.0001  Bio::Root::Root::new
>  1.71   0.142  1.078    474   0.0003 0.0023
> Bio::DB::BioSQL::BaseDriver::insert_object
>
> i do realize that these perl objects are large, but it still seems 
> quite
> slow.  (i'm not even sure whether the profiler demonstrates that the
> majority of time is spent instantiating perl objects as opposed to 
> running
> mysql commands.)
>
> all bioperl-db, bioperl, dbi and dbd-mysql tests came out ok (the vast
> majority of them anyway).
>
> incidentally, it took me a week of getting errors during
> load_seqdatabase.pl loading, before i discovered the true cause:  that
> a perl executable with threading enabled does NOT work with this.  (The
> author of dbd-mysql or dbi warns about this, but i didn't heed the 
> warning
> at first).
>
>
> if anyone has any ideas about what might be making it slow, please let 
> me
> know!  i'd greatly appreciate it.
>
> Sincerely,
>
> Henry Bigelow
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------




More information about the Bioperl-l mailing list