[BioPython] Entrez.efetch large files

Stephan Schiffels stephan80 at mac.com
Wed Oct 8 20:37:17 UTC 2008


  Hi Peter,

OK, first of all... you were right of course, with out_handle.write 
(net_handle.read()) the download works properly and reading the file  
from disk also works.The tutorial is very clear on that point, I agree.

To illustrate why I made the mistake even though I read the tutorial:
I made some code like:

try:
	unpickling a file as SeqRecord...
except IOError:
	download file into SeqRecord AND pickle afterwards to disk

So, as you can see, I already tried to make the download only once!  
The disk-saving step, I realized, was smarter to do via cPickle since  
then reading from it also goes faster than parsing the genbank file  
each time. So my goal was to either load a pickled SeqRecord, or  
download into SeqRecord and then pickle to disk. I hope you agree  
that concerning resources from NCBI this way is (at least in  
principle) already quite optimal. However, as you pointed out,  
parsing from the internet makes problems.

I think the advantages of not having to download each time were clear  
to me from the tutorial. Just that downloading AND parsing at the  
same time makes problems didnt appear to me. The addings to the  
tutorial seem to give some idea.

Thanks and Regards,
Stephan

Am 08.10.2008 um 21:32 schrieb Peter:

>> Yes - one big hint: DON'T try and parse these large files directly
>> from the internet.  Use efetch to download the file and save it to
>> disk.  Then open this local file for parsing.
>> ...
>> Do you think the Biopython tutorial should be more explicit about  
>> this
>> topic?
>
> I've changed the tutorial (the SeqIO and Entrez chapters) in CVS to
> make this advice more explicit, and included an example of doing this
> too.
>
> import os
> from Bio import SeqIO
> from Bio import Entrez
> Entrez.email = "A.N.Other at example.com"     # Always tell NCBI who  
> you are
> filename = "gi_186972394.gbk"
> if not os.path.isfile(filename) :
>     print "Downloading..."
>     net_handle = Entrez.efetch 
> (db="nucleotide",id="186972394",rettype="genbank")
>     out_handle = open(filename, "w")
>     out_handle.write(net_handle.read())
>     out_handle.close()
>     net_handle.close()
>     print "Saved"
>
> print "Parsing..."
> record = SeqIO.read(open(filename), "genbank")
> print record
>
>
> Peter



More information about the Biopython mailing list