[Biopython] History, Efetch, and returned records limits

Mon Apr 16 15:51:19 UTC 2012

Peter, Mariam,

Turns out they do document this:

   http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.retmax_3

I can also confirm this, just ran a quick test locally with a simple script to retrieve a set of protein samples.  The esearch count was 27382, but the retrieved set maxed out at 10K exactly.  

[cjfields at pyrimidine-laptop eutils]$ perl limit_test.pl 
27382
[cjfields at pyrimidine-laptop eutils]$ grep -c '^>' seqs.aa 
10000

Not sure if there are similar constraints using NCBI's SOAP interface, but I wouldn't be surprised.

chris

On Apr 16, 2012, at 9:12 AM, Fields, Christopher J wrote:

> Yeah, if you run a retrieval in batches you need a step to rerun the request in case it fails, particularly if the request is occurring at a busy time.  We do the same with bioperl's interface, very similar to what Peter suggests.
> 
> chris
> 
> On Apr 16, 2012, at 9:04 AM, Mariam Reyad Rizkallah wrote:
> 
>> OH WOW!
>> 
>> It works like charm! Peter, thank you very much for insight and for taking
>> the time to fix my script.
>> 
>> I do appreciate. Thank you.
>> 
>> Mariam
>> Blog post here:
>> http://opensourcepharmacist.wordpress.com/2012/04/16/the-community-biopython-overcomes-limitations/
>> 
>> 
>> On Sat, Apr 14, 2012 at 9:32 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:
>> 
>>> On Sat, Apr 14, 2012 at 6:39 PM, Peter Cock <p.j.a.cock at googlemail.com>
>>> wrote:
>>>> 
>>>> A somewhat brute force approach would be to do the
>>>> search (don't bother with the history) and get the 10313
>>>> GI numbers. Then use epost+efetch to grab the records
>>>> in batches of say 1000.
>>>> 
>>> 
>>> That does work (see below), but not all the time. A potential
>>> advantage of this way is that each fetch batch is a separate
>>> session, so retrying it should be straightforward.
>>> 
>>> Peter
>>> 
>>> #!/usr/bin/python
>>> import sys
>>> from Bio import Entrez
>>> Entrez.email = "mariam.rizkallah at gmail.com"
>>> Entrez.email = "p.j.a.cock at googlemail.com"
>>> txid = 543769
>>> name = "Rhizaria"
>>> 
>>> search_handle = Entrez.esearch(db="nucest",term="txid%s[Organism:exp]"
>>> %(txid), retmax="20000")
>>> search_results = Entrez.read(search_handle)
>>> search_handle.close()
>>> gi_list = search_results["IdList"]
>>> count = int(search_results["Count"])
>>> print count
>>> assert count == len(gi_list), len(gi_list)
>>> 
>>> out_fasta = "%s_txid%i_ct%i.fasta" %(name, txid, count)
>>> out_handle = open(out_fasta, "a")
>>> 
>>> out_fasta = "%s_txid%i_ct%i.fasta" %(name, txid, count)
>>> out_handle = open(out_fasta, "a")
>>> 
>>> ## Approach1: gets <XML> tags within the fasta file <ERROR>Unable to
>>> obtain query #1</ERROR>
>>> batch_size = 1000
>>> for start in range(0,count,batch_size):
>>>  end = min(count, start+batch_size)
>>>   batch = gi_list[start:end]
>>>  print "Going to download record %i to %i using epost+efetch" %
>>> (start+1, end)
>>>  post_results = Entrez.read(Entrez.epost("nucest", id=",".join(batch)))
>>>  webenv = post_results["WebEnv"]
>>>  query_key = post_results["QueryKey"]
>>>   fetch_handle = Entrez.efetch(db="nucest", rettype="fasta",
>>> retmode="text", webenv=webenv, query_key=query_key)
>>>   data = fetch_handle.read()
>>>  assert data.startswith(">"), data
>>>  fetch_handle.close()
>>>  out_handle.write(data)
>>> print "Done"
>>> out_handle.close()
>>> 
>> _______________________________________________
>> Biopython mailing list  -  Biopython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
> 
> 
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython