[Biopython] History, Efetch, and returned records limits

Mariam Reyad Rizkallah mrrizkalla at gmail.com
Mon Apr 16 14:04:08 UTC 2012


OH WOW!

It works like charm! Peter, thank you very much for insight and for taking
the time to fix my script.

I do appreciate. Thank you.

Mariam
Blog post here:
http://opensourcepharmacist.wordpress.com/2012/04/16/the-community-biopython-overcomes-limitations/


On Sat, Apr 14, 2012 at 9:32 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> On Sat, Apr 14, 2012 at 6:39 PM, Peter Cock <p.j.a.cock at googlemail.com>
> wrote:
> >
> > A somewhat brute force approach would be to do the
> > search (don't bother with the history) and get the 10313
> > GI numbers. Then use epost+efetch to grab the records
> > in batches of say 1000.
> >
>
> That does work (see below), but not all the time. A potential
> advantage of this way is that each fetch batch is a separate
> session, so retrying it should be straightforward.
>
> Peter
>
> #!/usr/bin/python
> import sys
> from Bio import Entrez
> Entrez.email = "mariam.rizkallah at gmail.com"
> Entrez.email = "p.j.a.cock at googlemail.com"
> txid = 543769
> name = "Rhizaria"
>
> search_handle = Entrez.esearch(db="nucest",term="txid%s[Organism:exp]"
> %(txid), retmax="20000")
> search_results = Entrez.read(search_handle)
> search_handle.close()
> gi_list = search_results["IdList"]
> count = int(search_results["Count"])
> print count
> assert count == len(gi_list), len(gi_list)
>
> out_fasta = "%s_txid%i_ct%i.fasta" %(name, txid, count)
> out_handle = open(out_fasta, "a")
>
> out_fasta = "%s_txid%i_ct%i.fasta" %(name, txid, count)
> out_handle = open(out_fasta, "a")
>
> ## Approach1: gets <XML> tags within the fasta file <ERROR>Unable to
> obtain query #1</ERROR>
> batch_size = 1000
> for start in range(0,count,batch_size):
>    end = min(count, start+batch_size)
>     batch = gi_list[start:end]
>    print "Going to download record %i to %i using epost+efetch" %
> (start+1, end)
>    post_results = Entrez.read(Entrez.epost("nucest", id=",".join(batch)))
>    webenv = post_results["WebEnv"]
>    query_key = post_results["QueryKey"]
>     fetch_handle = Entrez.efetch(db="nucest", rettype="fasta",
> retmode="text", webenv=webenv, query_key=query_key)
>     data = fetch_handle.read()
>    assert data.startswith(">"), data
>    fetch_handle.close()
>    out_handle.write(data)
> print "Done"
> out_handle.close()
>



More information about the Biopython mailing list