[BioPython] Entrez.efetch large files

Thu Oct 9 13:01:11 UTC 2008

Hi Peter,

Am 08.10.2008 um 22:57 schrieb Peter:

> I'm curious - do you have any numbers for the relative times to load a
> SeqRecord from a pickle, or re-parse it from the GenBank file?  I'm
> aware of some "hot spots" in the GenBank parser which take more time
> than they really need to (feature location parsing in particular).

So, here is a little profiling of reading a large chromosome both as  
genbank and from a pickled SeqRecord (both from disk of course):
 >>> t = Timer("a = cPickle.load(open('DroMel_chr2L.pickle'))",  
"import cPickle")
 >>> t.timeit(number=1)
5.2086620330810547
 >>> t = Timer("a = SeqIO.read(open('DroMel_chr2L.gbk'), 'genbank')",  
"from Bio import SeqIO")
 >>> t.timeit(number=1)
53.902437925338745
 >>>

As you see there is an amazing 10fold speed-gain using cPickle in  
comparison to SeqIO.read() ... not bad! The pickled file is a bit  
larger than the genbank file, but not much.

> However, even if using pickles is much faster, I would personally
> still rather use this approach:
>
> if file not present:
>    download from NCBI and save it
> parse file
>
Thats precisely how I do it now. Works cool!

> I think it is safer to keep the original data in the NCBI provided
> format, rather than as a python pickle.  Some of my reasons include:
>
> * you might want to parse the files with a different tool one day
> (e.g. grep, or maybe BioPerl, or EMBOSS)
> * different versions of Biopython will parse the file slightly
> differently (e.g. once Bugs 2225 and 2578 are fixed the SeqRecord
> should include slightly more information from a GenBank file) while
> your pickle will be static
> * if the SeqRecord or Seq objects themselves change slightly between
> versions of Biopython, the pickle may not work
> * more generally, is it safe to transfer the pickly files between
> different computers (e.g. different versions of python or Biopython,
> different OS, different line endings)?
>
> These issues may not be a problem in your setting.

You are right and in fact I now safe both the genbank file and the  
pickled file to disk, so I have all the backup.

>
> More generally, you could consider using BioSQL, but this may be
> overkill for your needs.
>
BioSQL is something that I like a lot. I have not yet digged my way  
through it but hopefully there will be options for me from that side  
as well.

>> However, as you pointed out, parsing from the internet makes  
>> problems.
>
> If you do work out exactly what is going wrong, I would be interested
> to hear about it.
>
Hmm, probably I wont find it out. Parsing from the internet works for  
small files, it must be some network-issue, dont know. Since I am in  
the university-web I doubt that the error starts at my side, maybe  
NCBI clears the connection if the other side is too slow, which is  
the case for the parsing process... But I understand too little about  
networking.

>> I think the advantages of not having to download each time were  
>> clear to me
>> from the tutorial. Just that downloading AND parsing at the same  
>> time makes
>> problems didnt appear to me. The addings to the tutorial seem to  
>> give some
>> idea.
>
> Your approach all makes sense. Thanks for explaining your thoughts.  I
> don't think I'd ever tried efetch on such a large GenBank file in the
> first place - for genomes I have usually used FTP instead.
>
> Peter

Regards,
Stephan