[Biopython] History, Efetch, and returned records limits

Mon Apr 16 14:12:56 UTC 2012

Yeah, if you run a retrieval in batches you need a step to rerun the request in case it fails, particularly if the request is occurring at a busy time.  We do the same with bioperl's interface, very similar to what Peter suggests.

chris

On Apr 16, 2012, at 9:04 AM, Mariam Reyad Rizkallah wrote:

> OH WOW!
> 
> It works like charm! Peter, thank you very much for insight and for taking
> the time to fix my script.
> 
> I do appreciate. Thank you.
> 
> Mariam
> Blog post here:
> http://opensourcepharmacist.wordpress.com/2012/04/16/the-community-biopython-overcomes-limitations/
> 
> 
> On Sat, Apr 14, 2012 at 9:32 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:
> 
>> On Sat, Apr 14, 2012 at 6:39 PM, Peter Cock <p.j.a.cock at googlemail.com>
>> wrote:
>>> 
>>> A somewhat brute force approach would be to do the
>>> search (don't bother with the history) and get the 10313
>>> GI numbers. Then use epost+efetch to grab the records
>>> in batches of say 1000.
>>> 
>> 
>> That does work (see below), but not all the time. A potential
>> advantage of this way is that each fetch batch is a separate
>> session, so retrying it should be straightforward.
>> 
>> Peter
>> 
>> #!/usr/bin/python
>> import sys
>> from Bio import Entrez
>> Entrez.email = "mariam.rizkallah at gmail.com"
>> Entrez.email = "p.j.a.cock at googlemail.com"
>> txid = 543769
>> name = "Rhizaria"
>> 
>> search_handle = Entrez.esearch(db="nucest",term="txid%s[Organism:exp]"
>> %(txid), retmax="20000")
>> search_results = Entrez.read(search_handle)
>> search_handle.close()
>> gi_list = search_results["IdList"]
>> count = int(search_results["Count"])
>> print count
>> assert count == len(gi_list), len(gi_list)
>> 
>> out_fasta = "%s_txid%i_ct%i.fasta" %(name, txid, count)
>> out_handle = open(out_fasta, "a")
>> 
>> out_fasta = "%s_txid%i_ct%i.fasta" %(name, txid, count)
>> out_handle = open(out_fasta, "a")
>> 
>> ## Approach1: gets <XML> tags within the fasta file <ERROR>Unable to
>> obtain query #1</ERROR>
>> batch_size = 1000
>> for start in range(0,count,batch_size):
>>   end = min(count, start+batch_size)
>>    batch = gi_list[start:end]
>>   print "Going to download record %i to %i using epost+efetch" %
>> (start+1, end)
>>   post_results = Entrez.read(Entrez.epost("nucest", id=",".join(batch)))
>>   webenv = post_results["WebEnv"]
>>   query_key = post_results["QueryKey"]
>>    fetch_handle = Entrez.efetch(db="nucest", rettype="fasta",
>> retmode="text", webenv=webenv, query_key=query_key)
>>    data = fetch_handle.read()
>>   assert data.startswith(">"), data
>>   fetch_handle.close()
>>   out_handle.write(data)
>> print "Done"
>> out_handle.close()
>> 
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython