[BioPython] Entrez.efetch large files

Peter biopython at maubp.freeserve.co.uk
Wed Oct 8 18:57:08 UTC 2008


On Wed, Oct 8, 2008 at 6:11 PM, Stephan <stephan80 at mac.com> wrote:
> Sorry to have an Entrez.efetch-issue again, but somehow there
> seems to be a problem with very large files.
> ...
> If I change the id to "56" (chromosome 4, which is shorter) it works.
> But for all the other chromosomes (ids: 57 - 61) it fails.
> If I download the genbank files manually from the ftp-server and
> then use SeqIO.read() it works, so the download-process corrupts
> the genbank files if they are very large (about 35 MB) I guess...
>
> Any hints?

Yes - one big hint: DON'T try and parse these large files directly
from the internet.  Use efetch to download the file and save it to
disk.  Then open this local file for parsing.

There are several good reasons for this:

(1) Rerunning the script (e.g. during development) needn't re-download
the file, which wastes time and money (yours and more importantly the
NCBI's).  You may be fine, but the NCBI can and do ban people's IP
addresses if they breach the guidelines.

(2) If the parsing fails, there is something to debug easily (the
local file).  You can open the file in a text editor to check it etc.

That being said, downloading and parsing in one go should work - I
would expect an IO error if the network timed out, rather than what
appears to be the data ending prematurely.  However, I don't expect
this to be easy to resolve - quite possibly this is a network time out
somewhere, maybe at your end, maybe on one of the ISP connections in
between.

On the bright side, at least the parser isn't silently ignoring the
end of the file, which would leave you with a truncated sequence
without any warnings :)

Do you think the Biopython tutorial should be more explicit about this
topic?  e.g. In chapter 4 (on Bio.SeqIO) I wrote:

>> Note that just because you can download sequence data and
>> parse it into a SeqRecord object in one go doesn't mean this
>> is always a good idea. In general, you should probably download
>> sequences once and save them to a file for reuse.

Maybe I should have said "... doesn't mean this is a good idea..." instead?

Peter



More information about the Biopython mailing list