[Biopython] History, Efetch, and returned records limits

Sat Apr 14 17:39:22 UTC 2012

Hi again,

I get a similar problem with this code - the first couple of
tries it got the first 5000 and then failed, but that doesn't
always happen:

$ python mariam.py
10313
Going to download record 1 to 1000
Going to download record 1001 to 2000
Going to download record 2001 to 3000
Traceback (most recent call last):
  File "mariam.py", line 28, in <module>
    assert data.startswith(">"), data
AssertionError: <?xml version="1.0" encoding="UTF-8"?>

<eFetchResult>
	<ERROR>Unable to obtain query #1</ERROR>
</eFetchResult>

Sometimes it gets further:

$ python mariam.py
10313
Going to download record 1 to 1000
Going to download record 1001 to 2000
Going to download record 2001 to 3000
Going to download record 3001 to 4000
Going to download record 4001 to 5000
Going to download record 5001 to 6000
Going to download record 6001 to 7000
Going to download record 7001 to 8000
Going to download record 8001 to 9000
Going to download record 9001 to 10000
Going to download record 10001 to 10313
Traceback (most recent call last):
  File "mariam.py", line 28, in <module>
    assert data.startswith(">"), data
AssertionError: <?xml version="1.0" encoding="UTF-8"?>

<eFetchResult>
	<ERROR>Unable to obtain query #1</ERROR>
</eFetchResult>

Notice that this demonstrates one of the major flaws with the
current NCBI Entrez setup - rather than setting an error HTTP
error code (which would trigger a clear exception), Entrez
returns the HTTP OK but puts and error in XML format
(essentially a silent error). This is most unhelpful IMO.

(This is something TogoWS handles much more nicely).

#!/usr/bin/python
import sys
from Bio import Entrez
Entrez.email = "mariam.rizkallah at gmail.com"
#Entrez.email = "p.j.a.cock at googlemail.com"
txid = 543769
name = "Rhizaria"

#using history
search_handle = Entrez.esearch(db="nucest",term="txid%s[Organism:exp]"
%(txid),  usehistory="y")
search_results = Entrez.read(search_handle)
search_handle.close()
gi_list = search_results["IdList"]
count = int(search_results["Count"])
print count
webenv = search_results["WebEnv"]
query_key = search_results["QueryKey"]
out_fasta = "%s_txid%i_ct%i.fasta" %(name, txid, count)
out_handle = open(out_fasta, "a")

#Sometimes get XML error not FASTA
batch_size = 1000
for start in range(0,count,batch_size):
    end = min(count, start+batch_size)
    print "Going to download record %i to %i" % (start+1, end)
    fetch_handle = Entrez.efetch(db="nucest", rettype="fasta",
retmode="text", retstart=start, retmax=batch_size, webenv=webenv,
query_key=query_key)
    data = fetch_handle.read()
    assert data.startswith(">"), data
    fetch_handle.close()
    out_handle.write(data)
print "Done"
out_handle.close()

This is how I believe the NCBI expect this task to be done.
In this specific case it seems to be an NCBI failure.
Perhaps a loop to retry the efetch two or three times might
work? It could be the whole history session breaks at the
NCBI end though...

A somewhat brute force approach would be to do the
search (don't bother with the history) and get the 10313
GI numbers. Then use epost+efetch to grab the records
in batches of say 1000.

Peter