[BioPython] Entrez.efetch large files
Stephan Schiffels
stephan.schiffels at uni-koeln.de
Thu Oct 9 09:01:11 EDT 2008
Hi Peter,
Am 08.10.2008 um 22:57 schrieb Peter:
> I'm curious - do you have any numbers for the relative times to load a
> SeqRecord from a pickle, or re-parse it from the GenBank file? I'm
> aware of some "hot spots" in the GenBank parser which take more time
> than they really need to (feature location parsing in particular).
So, here is a little profiling of reading a large chromosome both as
genbank and from a pickled SeqRecord (both from disk of course):
>>> t = Timer("a = cPickle.load(open('DroMel_chr2L.pickle'))",
"import cPickle")
>>> t.timeit(number=1)
5.2086620330810547
>>> t = Timer("a = SeqIO.read(open('DroMel_chr2L.gbk'), 'genbank')",
"from Bio import SeqIO")
>>> t.timeit(number=1)
53.902437925338745
>>>
As you see there is an amazing 10fold speed-gain using cPickle in
comparison to SeqIO.read() ... not bad! The pickled file is a bit
larger than the genbank file, but not much.
> However, even if using pickles is much faster, I would personally
> still rather use this approach:
>
> if file not present:
> download from NCBI and save it
> parse file
>
Thats precisely how I do it now. Works cool!
> I think it is safer to keep the original data in the NCBI provided
> format, rather than as a python pickle. Some of my reasons include:
>
> * you might want to parse the files with a different tool one day
> (e.g. grep, or maybe BioPerl, or EMBOSS)
> * different versions of Biopython will parse the file slightly
> differently (e.g. once Bugs 2225 and 2578 are fixed the SeqRecord
> should include slightly more information from a GenBank file) while
> your pickle will be static
> * if the SeqRecord or Seq objects themselves change slightly between
> versions of Biopython, the pickle may not work
> * more generally, is it safe to transfer the pickly files between
> different computers (e.g. different versions of python or Biopython,
> different OS, different line endings)?
>
> These issues may not be a problem in your setting.
You are right and in fact I now safe both the genbank file and the
pickled file to disk, so I have all the backup.
>
> More generally, you could consider using BioSQL, but this may be
> overkill for your needs.
>
BioSQL is something that I like a lot. I have not yet digged my way
through it but hopefully there will be options for me from that side
as well.
>> However, as you pointed out, parsing from the internet makes
>> problems.
>
> If you do work out exactly what is going wrong, I would be interested
> to hear about it.
>
Hmm, probably I wont find it out. Parsing from the internet works for
small files, it must be some network-issue, dont know. Since I am in
the university-web I doubt that the error starts at my side, maybe
NCBI clears the connection if the other side is too slow, which is
the case for the parsing process... But I understand too little about
networking.
>> I think the advantages of not having to download each time were
>> clear to me
>> from the tutorial. Just that downloading AND parsing at the same
>> time makes
>> problems didnt appear to me. The addings to the tutorial seem to
>> give some
>> idea.
>
> Your approach all makes sense. Thanks for explaining your thoughts. I
> don't think I'd ever tried efetch on such a large GenBank file in the
> first place - for genomes I have usually used FTP instead.
>
> Peter
Regards,
Stephan
More information about the BioPython
mailing list