[Biopython] History, Efetch, and returned records limits

Peter Cock p.j.a.cock at googlemail.com
Sat Apr 14 19:32:03 UTC 2012


On Sat, Apr 14, 2012 at 6:39 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>
> A somewhat brute force approach would be to do the
> search (don't bother with the history) and get the 10313
> GI numbers. Then use epost+efetch to grab the records
> in batches of say 1000.
>

That does work (see below), but not all the time. A potential
advantage of this way is that each fetch batch is a separate
session, so retrying it should be straightforward.

Peter

#!/usr/bin/python
import sys
from Bio import Entrez
Entrez.email = "mariam.rizkallah at gmail.com"
Entrez.email = "p.j.a.cock at googlemail.com"
txid = 543769
name = "Rhizaria"

search_handle = Entrez.esearch(db="nucest",term="txid%s[Organism:exp]"
%(txid), retmax="20000")
search_results = Entrez.read(search_handle)
search_handle.close()
gi_list = search_results["IdList"]
count = int(search_results["Count"])
print count
assert count == len(gi_list), len(gi_list)

out_fasta = "%s_txid%i_ct%i.fasta" %(name, txid, count)
out_handle = open(out_fasta, "a")

out_fasta = "%s_txid%i_ct%i.fasta" %(name, txid, count)
out_handle = open(out_fasta, "a")

## Approach1: gets <XML> tags within the fasta file <ERROR>Unable to
obtain query #1</ERROR>
batch_size = 1000
for start in range(0,count,batch_size):
    end = min(count, start+batch_size)
    batch = gi_list[start:end]
    print "Going to download record %i to %i using epost+efetch" %
(start+1, end)
    post_results = Entrez.read(Entrez.epost("nucest", id=",".join(batch)))
    webenv = post_results["WebEnv"]
    query_key = post_results["QueryKey"]
    fetch_handle = Entrez.efetch(db="nucest", rettype="fasta",
retmode="text", webenv=webenv, query_key=query_key)
    data = fetch_handle.read()
    assert data.startswith(">"), data
    fetch_handle.close()
    out_handle.write(data)
print "Done"
out_handle.close()



More information about the Biopython mailing list