[Biopython] History, Efetch, and returned records limits

Mon Apr 16 16:19:31 UTC 2012

Oh! Thank you, Chris. So, there IS a limit!
I emailed them asking whether there is a limit for records retrieval. They
replied that The appropriate way is to do batch retrieval, with no emphasis
on limits.

Thank you.

Mariam
On Apr 16, 2012 5:51 PM, "Fields, Christopher J" <cjfields at illinois.edu>
wrote:

> Peter, Mariam,
>
> Turns out they do document this:
>
>   http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.retmax_3
>
> I can also confirm this, just ran a quick test locally with a simple
> script to retrieve a set of protein samples.  The esearch count was 27382,
> but the retrieved set maxed out at 10K exactly.
>
> [cjfields at pyrimidine-laptop eutils]$ perl limit_test.pl
> 27382
> [cjfields at pyrimidine-laptop eutils]$ grep -c '^>' seqs.aa
> 10000
>
> Not sure if there are similar constraints using NCBI's SOAP interface, but
> I wouldn't be surprised.
>
> chris
>
> On Apr 16, 2012, at 9:12 AM, Fields, Christopher J wrote:
>
> > Yeah, if you run a retrieval in batches you need a step to rerun the
> request in case it fails, particularly if the request is occurring at a
> busy time.  We do the same with bioperl's interface, very similar to what
> Peter suggests.
> >
> > chris
> >
> > On Apr 16, 2012, at 9:04 AM, Mariam Reyad Rizkallah wrote:
> >
> >> OH WOW!
> >>
> >> It works like charm! Peter, thank you very much for insight and for
> taking
> >> the time to fix my script.
> >>
> >> I do appreciate. Thank you.
> >>
> >> Mariam
> >> Blog post here:
> >>
> http://opensourcepharmacist.wordpress.com/2012/04/16/the-community-biopython-overcomes-limitations/
> >>
> >>
> >> On Sat, Apr 14, 2012 at 9:32 PM, Peter Cock <p.j.a.cock at googlemail.com
> >wrote:
> >>
> >>> On Sat, Apr 14, 2012 at 6:39 PM, Peter Cock <p.j.a.cock at googlemail.com
> >
> >>> wrote:
> >>>>
> >>>> A somewhat brute force approach would be to do the
> >>>> search (don't bother with the history) and get the 10313
> >>>> GI numbers. Then use epost+efetch to grab the records
> >>>> in batches of say 1000.
> >>>>
> >>>
> >>> That does work (see below), but not all the time. A potential
> >>> advantage of this way is that each fetch batch is a separate
> >>> session, so retrying it should be straightforward.
> >>>
> >>> Peter
> >>>
> >>> #!/usr/bin/python
> >>> import sys
> >>> from Bio import Entrez
> >>> Entrez.email = "mariam.rizkallah at gmail.com"
> >>> Entrez.email = "p.j.a.cock at googlemail.com"
> >>> txid = 543769
> >>> name = "Rhizaria"
> >>>
> >>> search_handle = Entrez.esearch(db="nucest",term="txid%s[Organism:exp]"
> >>> %(txid), retmax="20000")
> >>> search_results = Entrez.read(search_handle)
> >>> search_handle.close()
> >>> gi_list = search_results["IdList"]
> >>> count = int(search_results["Count"])
> >>> print count
> >>> assert count == len(gi_list), len(gi_list)
> >>>
> >>> out_fasta = "%s_txid%i_ct%i.fasta" %(name, txid, count)
> >>> out_handle = open(out_fasta, "a")
> >>>
> >>> out_fasta = "%s_txid%i_ct%i.fasta" %(name, txid, count)
> >>> out_handle = open(out_fasta, "a")
> >>>
> >>> ## Approach1: gets <XML> tags within the fasta file <ERROR>Unable to
> >>> obtain query #1</ERROR>
> >>> batch_size = 1000
> >>> for start in range(0,count,batch_size):
> >>>  end = min(count, start+batch_size)
> >>>   batch = gi_list[start:end]
> >>>  print "Going to download record %i to %i using epost+efetch" %
> >>> (start+1, end)
> >>>  post_results = Entrez.read(Entrez.epost("nucest", id=",".join(batch)))
> >>>  webenv = post_results["WebEnv"]
> >>>  query_key = post_results["QueryKey"]
> >>>   fetch_handle = Entrez.efetch(db="nucest", rettype="fasta",
> >>> retmode="text", webenv=webenv, query_key=query_key)
> >>>   data = fetch_handle.read()
> >>>  assert data.startswith(">"), data
> >>>  fetch_handle.close()
> >>>  out_handle.write(data)
> >>> print "Done"
> >>> out_handle.close()
> >>>
> >> _______________________________________________
> >> Biopython mailing list  -  Biopython at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/biopython
> >
> >
> > _______________________________________________
> > Biopython mailing list  -  Biopython at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biopython
>
>