[BioPython] Entrez.efetch large files
Stephan Schiffels
stephan80 at mac.com
Wed Oct 8 20:37:17 UTC 2008
Hi Peter,
OK, first of all... you were right of course, with out_handle.write
(net_handle.read()) the download works properly and reading the file
from disk also works.The tutorial is very clear on that point, I agree.
To illustrate why I made the mistake even though I read the tutorial:
I made some code like:
try:
unpickling a file as SeqRecord...
except IOError:
download file into SeqRecord AND pickle afterwards to disk
So, as you can see, I already tried to make the download only once!
The disk-saving step, I realized, was smarter to do via cPickle since
then reading from it also goes faster than parsing the genbank file
each time. So my goal was to either load a pickled SeqRecord, or
download into SeqRecord and then pickle to disk. I hope you agree
that concerning resources from NCBI this way is (at least in
principle) already quite optimal. However, as you pointed out,
parsing from the internet makes problems.
I think the advantages of not having to download each time were clear
to me from the tutorial. Just that downloading AND parsing at the
same time makes problems didnt appear to me. The addings to the
tutorial seem to give some idea.
Thanks and Regards,
Stephan
Am 08.10.2008 um 21:32 schrieb Peter:
>> Yes - one big hint: DON'T try and parse these large files directly
>> from the internet. Use efetch to download the file and save it to
>> disk. Then open this local file for parsing.
>> ...
>> Do you think the Biopython tutorial should be more explicit about
>> this
>> topic?
>
> I've changed the tutorial (the SeqIO and Entrez chapters) in CVS to
> make this advice more explicit, and included an example of doing this
> too.
>
> import os
> from Bio import SeqIO
> from Bio import Entrez
> Entrez.email = "A.N.Other at example.com" # Always tell NCBI who
> you are
> filename = "gi_186972394.gbk"
> if not os.path.isfile(filename) :
> print "Downloading..."
> net_handle = Entrez.efetch
> (db="nucleotide",id="186972394",rettype="genbank")
> out_handle = open(filename, "w")
> out_handle.write(net_handle.read())
> out_handle.close()
> net_handle.close()
> print "Saved"
>
> print "Parsing..."
> record = SeqIO.read(open(filename), "genbank")
> print record
>
>
> Peter
More information about the Biopython
mailing list