[BioSQL-l] Timing importing GenBank files into BioSQL

Mon Aug 18 12:23:38 EDT 2008

Hi,

I've started trying to look at BioPerl and Biopython and how well they
agree in writing GenBank files into BioSQL.  I've been using the
BioPerl load_seqdatabase.pl script to import sample GenBank files, but
I was a little surprised how long this takes to run for E. coli K12,
NC_000913.gbk (about 10 minutes!).  I'm using E coli K12, NC_000913.2
from ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K12_substr__MG1655/NC_000913.gbk
and Nanoarchaeum equitans, NC_005213.1 from
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Nanoarchaeum_equitans/NC_005213.gbk
as my example input files.

Example timing using BioPerl, after emptying most (all?) of my MySQL
test database:

$ mysql --user="gbrowse" --pass="biosql" test_biosql -e "truncate
table bioentry; truncate table seqfeature; truncate table
bioentry_dbxref; truncate table term; truncate table ontology;
truncate table reference; truncate table dbxref;"

$ time perl ~/Downloads/Software/bioperl-db-1.5.2_100/scripts/biosql/load_seqdatabase.pl
--dbname test_biosql --namespace test --format genbank --dbpass biosql
--dbuser gbrowse Nanoarchaeum_equitans/NC_005213.gbk
Loading Nanoarchaeum_equitans/NC_005213.gbk ...

real	0m17.116s
user	0m13.914s
sys	0m2.293s

$ time perl ~/Downloads/Software/bioperl-db-1.5.2_100/scripts/biosql/load_seqdatabase.pl
--dbname test_biosql --namespace test --format genbank --dbpass biosql
--dbuser gbrowse Escherichia_coli_K12_substr__MG1655/NC_000913.gbk
Loading Escherichia_coli_K12_substr__MG1655/NC_000913.gbk ...

real	10m0.784s
user	6m23.898s
sys	3m26.189s

This does seem a rather unreasonable length of time (and I've repeated
this over three times).  Is this normal?  I know this may not be a
fair comparison, but this it what Biopython takes (code at end of
email):

$ mysql --user="gbrowse" --pass="biosql" test_biosql -e "truncate
table bioentry; truncate table seqfeature; truncate table
bioentry_dbxref; truncate table term; truncate table ontology;
truncate table reference; truncate table dbxref;"

$ time python load.py
Importing Nanoarchaeum_equitans/NC_005213.gbk
Loaded 1 records
Took 5.32s include the commit
Importing Escherichia_coli_K12_substr__MG1655/NC_000913.gbk
Loaded 1 records
Took 64.15s including the commit

real	1m10.037s
user	0m31.942s
sys	0m6.913s

I'm wondering if the BioPerl time is typical (I hope not), and if
there are any computationally intensive or otherwise slow things it
does that BioPython might be skipping (checksums? fetching taxonomy?)

Thanks

Peter

---------------------------------------------------------------------
The contents of my load.py script:

import time
from Bio import SeqIO
from BioSQL import BioSeqDatabase
server = BioSeqDatabase.open_database(driver="MySQLdb", user="gbrowse",
                passwd = "biosql", host = "localhost", db="test_biosql")

db = server["test"]

start = time.time()
filename = "Nanoarchaeum_equitans/NC_005213.gbk"
print "Importing %s" % filename
records = SeqIO.parse(open(filename), "genbank")
print "Loaded %i records" % db.load(records)
server.adaptor.commit()
print "Took %0.2fs including the commit" % (time.time()-start)

start = time.time()
filename = "Escherichia_coli_K12_substr__MG1655/NC_000913.gbk"
print "Importing %s" % filename
records = SeqIO.parse(open(filename), "genbank")
print "Loaded %i records" % db.load(records)
server.adaptor.commit()
print "Took %0.2fs including the commit" % (time.time()-start)