[BioPython] Entrez.efetch large files

Peter biopython at maubp.freeserve.co.uk
Thu Oct 9 14:18:52 UTC 2008


Peter wrote:
>> I'm curious - do you have any numbers for the relative times to load a
>> SeqRecord from a pickle, or re-parse it from the GenBank file?  I'm
>> aware of some "hot spots" in the GenBank parser which take more time
>> than they really need to (feature location parsing in particular).

Stephan wrote:
> So, here is a little profiling of reading a large chromosome both as genbank
> and from a pickled SeqRecord (both from disk of course):
>>>> t = Timer("a = cPickle.load(open('DroMel_chr2L.pickle'))", "import
>>>> cPickle")
>>>> t.timeit(number=1)
> 5.2086620330810547
>>>> t = Timer("a = SeqIO.read(open('DroMel_chr2L.gbk'), 'genbank')", "from
>>>> Bio import SeqIO")
>>>> t.timeit(number=1)
> 53.902437925338745
>>>>
>
> As you see there is an amazing 10fold speed-gain using cPickle in comparison
> to SeqIO.read() ... not bad! The pickled file is a bit larger than the
> genbank file, but not much.

I'm seeing more like a three fold speed-gain (using cPickle protocol
0, with Python 2.5.2 on a Mac), which is less impressive.  For a 10
fold speed up I can see why the complexity overhead of using pickle
could be worthwhile.

cPickle.load() took 8.5s
cPickle.load() took 10.0s
cPickle.load() took 9.9s
SeqIO.read() took 29.9s
SeqIO.read() took 29.8s
SeqIO.read() took 29.8s

(Script below)

I'm not very impressed with the 30 seconds needed to parse a 30MB
file.  There is certainly scope for speeding up the GenBank parsing
here.

Peter

---------------

My timing script:

import os
import cPickle
import time
from Bio import Entrez, SeqIO
#Entrez.email = "..."

id="57"
genbank_filename = "NC_004354.gbk"
pickle_filename = "NC_004354.pickle"

if not os.path.isfile(genbank_filename) :
    print "Downloading..."
    net_handle = Entrez.efetch(db="genome", id=id, rettype="genbank")
    out_handle = open(genbank_filename, "w")
    out_handle.write(net_handle.read())
    out_handle.close()
    print "Saved"

if not os.path.isfile(pickle_filename) :
    print "Parsing..."
    record = SeqIO.read(open(genbank_filename), 'genbank')
    print "Pickling..."
    out_handle = open(pickle_filename ,"w")
    cPickle.dump(record, out_handle)
    out_handle.close()
    print "Saved"

print "Profiling..."
for i in range(3) :
    start = time.time()
    record = cPickle.load(open(pickle_filename))
    print "cPickle.load() took %0.1fs" % (time.time() - start)
for i in range(3) :
    start = time.time()
    record = SeqIO.read(open(genbank_filename), 'genbank')
    print "SeqIO.read() took %0.1fs" % (time.time() - start)
print "Done"



More information about the Biopython mailing list