[BioSQL-l] Timing importing GenBank files into BioSQL
Peter
biopython at maubp.freeserve.co.uk
Mon Aug 18 16:23:38 UTC 2008
Hi,
I've started trying to look at BioPerl and Biopython and how well they
agree in writing GenBank files into BioSQL. I've been using the
BioPerl load_seqdatabase.pl script to import sample GenBank files, but
I was a little surprised how long this takes to run for E. coli K12,
NC_000913.gbk (about 10 minutes!). I'm using E coli K12, NC_000913.2
from ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K12_substr__MG1655/NC_000913.gbk
and Nanoarchaeum equitans, NC_005213.1 from
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Nanoarchaeum_equitans/NC_005213.gbk
as my example input files.
Example timing using BioPerl, after emptying most (all?) of my MySQL
test database:
$ mysql --user="gbrowse" --pass="biosql" test_biosql -e "truncate
table bioentry; truncate table seqfeature; truncate table
bioentry_dbxref; truncate table term; truncate table ontology;
truncate table reference; truncate table dbxref;"
$ time perl ~/Downloads/Software/bioperl-db-1.5.2_100/scripts/biosql/load_seqdatabase.pl
--dbname test_biosql --namespace test --format genbank --dbpass biosql
--dbuser gbrowse Nanoarchaeum_equitans/NC_005213.gbk
Loading Nanoarchaeum_equitans/NC_005213.gbk ...
real 0m17.116s
user 0m13.914s
sys 0m2.293s
$ time perl ~/Downloads/Software/bioperl-db-1.5.2_100/scripts/biosql/load_seqdatabase.pl
--dbname test_biosql --namespace test --format genbank --dbpass biosql
--dbuser gbrowse Escherichia_coli_K12_substr__MG1655/NC_000913.gbk
Loading Escherichia_coli_K12_substr__MG1655/NC_000913.gbk ...
real 10m0.784s
user 6m23.898s
sys 3m26.189s
This does seem a rather unreasonable length of time (and I've repeated
this over three times). Is this normal? I know this may not be a
fair comparison, but this it what Biopython takes (code at end of
email):
$ mysql --user="gbrowse" --pass="biosql" test_biosql -e "truncate
table bioentry; truncate table seqfeature; truncate table
bioentry_dbxref; truncate table term; truncate table ontology;
truncate table reference; truncate table dbxref;"
$ time python load.py
Importing Nanoarchaeum_equitans/NC_005213.gbk
Loaded 1 records
Took 5.32s include the commit
Importing Escherichia_coli_K12_substr__MG1655/NC_000913.gbk
Loaded 1 records
Took 64.15s including the commit
real 1m10.037s
user 0m31.942s
sys 0m6.913s
I'm wondering if the BioPerl time is typical (I hope not), and if
there are any computationally intensive or otherwise slow things it
does that BioPython might be skipping (checksums? fetching taxonomy?)
Thanks
Peter
---------------------------------------------------------------------
The contents of my load.py script:
import time
from Bio import SeqIO
from BioSQL import BioSeqDatabase
server = BioSeqDatabase.open_database(driver="MySQLdb", user="gbrowse",
passwd = "biosql", host = "localhost", db="test_biosql")
db = server["test"]
start = time.time()
filename = "Nanoarchaeum_equitans/NC_005213.gbk"
print "Importing %s" % filename
records = SeqIO.parse(open(filename), "genbank")
print "Loaded %i records" % db.load(records)
server.adaptor.commit()
print "Took %0.2fs including the commit" % (time.time()-start)
start = time.time()
filename = "Escherichia_coli_K12_substr__MG1655/NC_000913.gbk"
print "Importing %s" % filename
records = SeqIO.parse(open(filename), "genbank")
print "Loaded %i records" % db.load(records)
server.adaptor.commit()
print "Took %0.2fs including the commit" % (time.time()-start)
More information about the BioSQL-l
mailing list